using CSV # for reading in delimited files
using DataFrames # for storing data as type DataFrame
using Haversine # for calculating Great Circle (haversine) distances between sites
using Statistics # for mean() function
using MultivariateStats # for Principal Coordinates Analysis (multidimensional scaling)
using DelimitedFiles # for reading delimited files (the genotypic data)
using Impute # for imputing missing genotypes
using JLD2 # for saving data
using CairoMakie # for plots
using PrettyTables
CairoMakie.activate!() # this makes CairoMakie the main package for figures (in case another already loaded)Greenish Warbler Genomic Analysis
This page contains notes and code describing the data analysis for a manuscript on Greenish Warbler genomics. I’ve been working with the data for several years, and the R and then Julia code has been in development for a while. This is a Quarto notebook, which can run and display the results of Julia (or other) code blocks, along with text narration, and output in html, pdf, Word, etc.
The Julia code here is loosely based on R code written for Greenish Warbler analysis (Irwin et al. 2016), and then the North American warbler analyses (Irwin et al. 2019), and then my (unpublished) 2019 Greenish Warbler analysis. Most recently, this was adapted from the scripts called GW2022_R_analysis_script.R and IrwinLabGenomicsAnalysisScript.jl but has had a lot of opimizations since then. The SNP data here are a result of GBS reads mapped to our new 2022 Biozeron genome assembly for a greenish warbler from southern China.
Load packages
If running this for the first time, you will need to load packages used in the script, so run what is in this section below. It will take some time to install and precompile the packages:
import Pkg; Pkg.add("CSV") # took less than a minute
Pkg.add("DataFrames") # took about a minute
Pkg.add("Plots") # seems to install and working more simply than Makie (but less powerful)
Pkg.add("Haversine") # for great circle (Haversine) distances
Pkg.add("Distributions") # this seemed to fix a problem installing GLMakie
Pkg.add("MultivariateStats")
Pkg.add("StatsBase")
Pkg.add("Impute")
Pkg.add("JLD2")
Pkg.add("CairoMakie")
Pkg.add("PrettyTables") # for printing nice tables to REPL
Pkg.add("GenomicDiversity")Now actually load those packages into the Julia session:
Load my custom package GenomicDiversity:
using GenomicDiversity # actually make GenomicDiversity module available with GenomicDiversity.functionName(),
# or if functions are exported from GenomicDiversity then they are available.Test Julia:
x = 1; y = 2; z = x+y
println("z = ", z)z = 3
(If Quarto is calling Julia properly, you will see z = 3 as the output of the code block above.)
Choose working directory:
repoDirectory = pwd() # this gets the starting working directory, for later use
dataDirectory = "/Users/darrenirwin/Dropbox/Darren's current work/"
cd(dataDirectory)Load the chromosome (scaffold) lengths
The code below will load a fasta index file for the reference genome, and then make a dictionary (a look-up table) where the key is the scaffold name and value is the chromosome length. This will be used in the genotype-by-individual figures.
cd(repoDirectory)
scaffold_info_filepath = "metadata/GW2022ref.fa.fai" # a fasta index file for the reference genome
scaffold_info = DataFrame(CSV.File(scaffold_info_filepath; header=["name", "length", "offset", "linebases", "linewidth"], types=[String, Int, Int, Int, Int]))
scaffold_lengths = Dict(scaffold_info.name .=> scaffold_info.length)
cd(dataDirectory)Set up file directories and names
# choose path and filename for the 012NA files
baseName = "GW_genomics_2022_with_new_genome/GW2022_GBS_012NA_files/GW2022_all4plates.genotypes.SNPs_only.whole_genome"
filenameTextMiddle = ".max2allele_noindel.vcf.maxmiss"
# indicate percent threshold for missing genotypes for each SNP--
# this was set by earlier filtering, and is just a record-keeper for the filenames:
missingGenotypeThreshold = 60
filenameTextEnd = ".MQ20.lowHet.tab"
tagName = ".Jan2025." # choose a tag name for this analysis
# indicate name of metadata file, a text file with these column headings:
# ID location group Fst_group plot_order
metadataFile = "GW_genomics_2022_with_new_genome/GW_all4plates.Fst_groups.txt""GW_genomics_2022_with_new_genome/GW_all4plates.Fst_groups.txt"
List the scaffolds to be imputed (for later inclusion in PCA)
chromosomes_to_process = vec(["gw2",
"gw1",
"gw3",
"gwZ",
"gw1A",
"gw4",
"gw5",
"gw7",
"gw6",
"gw8",
"gw9",
"gw11",
"gw12",
"gw10",
"gw13",
"gw14",
"gw18",
"gw20",
"gw15",
"gw1B",
"gws100",
"gw17",
"gw19",
"gws101",
"gw4A",
"gw21",
"gw26",
"gws102",
"gw23",
"gw25",
"gws103",
"gw22",
"gws104",
"gw28",
"gw27",
"gw24",
"gws105",
"gws106",
"gws107",
"gws108",
"gws109",
"gws110",
"gws112"]);Set up function to correct the names of GBS runs:
function correctNames(metadataColumn)
metadataColumn_corrected = replace(metadataColumn,
"GW_Armando_plate1_TTGW05_rep2" => "GW_Armando_plate1_TTGW05r2",
"GW_Lane5_NA3-3ul" => "GW_Lane5_NA3",
"GW_Armando_plate1_TTGW_15_05" => "GW_Armando_plate1_TTGW-15-05",
"GW_Armando_plate1_TTGW_15_07" => "GW_Armando_plate1_TTGW-15-07",
"GW_Armando_plate1_TTGW_15_08" => "GW_Armando_plate1_TTGW-15-08",
"GW_Armando_plate1_TTGW_15_09" => "GW_Armando_plate1_TTGW-15-09",
"GW_Armando_plate1_TTGW_15_01" => "GW_Armando_plate1_TTGW-15-01",
"GW_Armando_plate1_TTGW_15_02" => "GW_Armando_plate1_TTGW-15-02",
"GW_Armando_plate1_TTGW_15_03" => "GW_Armando_plate1_TTGW-15-03",
"GW_Armando_plate1_TTGW_15_04" => "GW_Armando_plate1_TTGW-15-04",
"GW_Armando_plate1_TTGW_15_06" => "GW_Armando_plate1_TTGW-15-06",
"GW_Armando_plate1_TTGW_15_10" => "GW_Armando_plate1_TTGW-15-10",
"GW_Armando_plate2_TTGW_15_01" => "GW_Armando_plate2_TTGW-15-01",
"GW_Armando_plate2_TTGW_15_02" => "GW_Armando_plate2_TTGW-15-02",
"GW_Armando_plate2_TTGW_15_03" => "GW_Armando_plate2_TTGW-15-03",
"GW_Armando_plate2_TTGW_15_04" => "GW_Armando_plate2_TTGW-15-04",
"GW_Armando_plate2_TTGW_15_06" => "GW_Armando_plate2_TTGW-15-06",
"GW_Armando_plate2_TTGW_15_10" => "GW_Armando_plate2_TTGW-15-10")
endcorrectNames (generic function with 1 method)
The above should be run before any of the other Quarto pages are run
OK, let’s load the genomic data!
# load metadata
metadata = DataFrame(CSV.File(metadataFile)) # the CSV.File function interprets the correct delimiter
num_metadata_cols = ncol(metadata)
num_individuals = nrow(metadata)
# read in individual names for this dataset
individuals_file_name = string(baseName, filenameTextMiddle, missingGenotypeThreshold, filenameTextEnd, ".012.indv")
ind = DataFrame(CSV.File(individuals_file_name; header=["ind"], types=[String]))
indNum = size(ind, 1) # number of individuals
if num_individuals != indNum
println("WARNING: number of rows in metadata file different than number of individuals in .indv file")
end
# read in position data for this dataset
position_file_name = string(baseName, filenameTextMiddle, missingGenotypeThreshold, filenameTextEnd, ".012.pos")
pos_whole_genome = DataFrame(CSV.File(position_file_name; header=["chrom", "position"], types=[String, Int]))
# read in genotype data
genotype_file_name = string(baseName, filenameTextMiddle, missingGenotypeThreshold, filenameTextEnd, ".012minus1")
@time if 1 <= indNum <= 127
geno = readdlm(genotype_file_name, '\t', Int8, '\n'); # this has been sped up dramatically, by first converting "NA" to -1
elseif 128 <= indNum <= 32767
geno = readdlm(genotype_file_name, '\t', Int16, '\n'); # this needed for first column, which is number of individual; Int16 not much slower on import than Int8
else
print("Error: Number of individuals in .indv appears outside of range from 1 to 32767")
end
loci_count = size(geno, 2) - 1 # because the first column is not a SNP (just a count from zero)
print(string("Read in genotypic data at ", loci_count," loci for ", indNum, " individuals. \n")) 72.650708 seconds (4.50 M allocations: 16.001 GiB, 2.10% gc time, 1.06% compilation time)
Read in genotypic data at 2431709 loci for 310 individuals.
Check that individuals are same in genotype data and metadata
ind_with_metadata = hcat(ind, metadata)
println(ind_with_metadata)
println() # prints a line break
if isequal(ind_with_metadata.ind, ind_with_metadata.ID)
println("GOOD NEWS: names of individuals in metadata file and genotype ind file match perfectly.")
else
println("WARNING: names of individuals in metadata file and genotype ind file do not completely match.")
end310×6 DataFrame
Row │ ind ID location group Fst_group plot_order
│ String String31 String7 String15 String15 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ GW_Armando_plate1_AB1 GW_Armando_plate1_AB1 AB vir vir 20.01
2 │ GW_Armando_plate1_JF07G02 GW_Armando_plate1_JF07G02 ST plumb plumb 108.0
3 │ GW_Armando_plate1_JF07G03 GW_Armando_plate1_JF07G03 ST plumb plumb 109.0
4 │ GW_Armando_plate1_JF07G04 GW_Armando_plate1_JF07G04 ST plumb plumb 110.0
5 │ GW_Armando_plate1_JF08G02 GW_Armando_plate1_JF08G02 ST plumb plumb 111.0
6 │ GW_Armando_plate1_JF09G01 GW_Armando_plate1_JF09G01 ST plumb plumb 112.0
7 │ GW_Armando_plate1_JF09G02 GW_Armando_plate1_JF09G02 ST plumb plumb 113.0
8 │ GW_Armando_plate1_JF10G03 GW_Armando_plate1_JF10G03 ST plumb_vir plumb_vir 170.0
9 │ GW_Armando_plate1_JF11G01 GW_Armando_plate1_JF11G01 ST plumb plumb 114.0
10 │ GW_Armando_plate1_JF12G01 GW_Armando_plate1_JF12G01 ST plumb plumb 115.0
11 │ GW_Armando_plate1_JF12G02 GW_Armando_plate1_JF12G02 ST plumb plumb 116.0
12 │ GW_Armando_plate1_JF12G04 GW_Armando_plate1_JF12G04 ST_vi vir vir 24.001
13 │ GW_Armando_plate1_JF13G01 GW_Armando_plate1_JF13G01 ST plumb plumb 117.0
14 │ GW_Armando_plate1_JF15G03 GW_Armando_plate1_JF15G03 DV plumb plumb 103.0
15 │ GW_Armando_plate1_JF16G01 GW_Armando_plate1_JF16G01 DV_vi plumb_vir vir 24.041
16 │ GW_Armando_plate1_JF20G01 GW_Armando_plate1_JF20G01 MB plumb plumb 94.0
17 │ GW_Armando_plate1_JF22G01 GW_Armando_plate1_JF22G01 MB plumb plumb 95.0
18 │ GW_Armando_plate1_JF23G01 GW_Armando_plate1_JF23G01 VB plumb plumb 98.0
19 │ GW_Armando_plate1_JF23G02 GW_Armando_plate1_JF23G02 VB plumb plumb 99.0
20 │ GW_Armando_plate1_JF24G02 GW_Armando_plate1_JF24G02 VB plumb plumb 100.0
21 │ GW_Armando_plate1_JF26G01 GW_Armando_plate1_JF26G01 ST plumb plumb 118.0
22 │ GW_Armando_plate1_JF27G01 GW_Armando_plate1_JF27G01 ST plumb plumb 119.0
23 │ GW_Armando_plate1_JF29G01 GW_Armando_plate1_JF29G01 ST plumb plumb 120.0
24 │ GW_Armando_plate1_JF29G02 GW_Armando_plate1_JF29G02 ST plumb plumb 121.0
25 │ GW_Armando_plate1_JF29G03 GW_Armando_plate1_JF29G03 ST plumb plumb 122.0
26 │ GW_Armando_plate1_JG02G02 GW_Armando_plate1_JG02G02 PR plumb plumb 145.0
27 │ GW_Armando_plate1_JG02G04 GW_Armando_plate1_JG02G04 PR plumb plumb 146.0
28 │ GW_Armando_plate1_JG08G01 GW_Armando_plate1_JG08G01 ST plumb plumb 123.0
29 │ GW_Armando_plate1_JG08G02 GW_Armando_plate1_JG08G02 ST plumb plumb 124.0
30 │ GW_Armando_plate1_JG10G01 GW_Armando_plate1_JG10G01 ST plumb plumb 125.0
31 │ GW_Armando_plate1_JG12G01 GW_Armando_plate1_JG12G01 ST plumb plumb 126.0
32 │ GW_Armando_plate1_JG17G01 GW_Armando_plate1_JG17G01 ST plumb_vir plumb 127.0
33 │ GW_Armando_plate1_NO_BC_TTGW05 GW_Armando_plate1_NO_BC_TTGW05 blank blank blank -99.0
34 │ GW_Armando_plate1_NO_DNA GW_Armando_plate1_NO_DNA blank blank blank -99.0
35 │ GW_Armando_plate1_RF20G01 GW_Armando_plate1_RF20G01 BJ obs_plumb plumb_BJ 77.501
36 │ GW_Armando_plate1_RF29G02 GW_Armando_plate1_RF29G02 BJ obs_plumb plumb_BJ 77.502
37 │ GW_Armando_plate1_TL3 GW_Armando_plate1_TL3 TL vir vir 11.01
38 │ GW_Armando_plate1_TTGW01 GW_Armando_plate1_TTGW01 MN troch_MN troch_west 53.0
39 │ GW_Armando_plate1_TTGW05_rep1 GW_Armando_plate1_TTGW05_rep1 MN_rep troch_MN_rep troch_west_rep 53.0
40 │ GW_Armando_plate1_TTGW05_rep2 GW_Armando_plate1_TTGW05_rep2 MN troch_MN troch_west 53.0
41 │ GW_Armando_plate1_TTGW06 GW_Armando_plate1_TTGW06 SU lud_Sukhto lud_central 47.0
42 │ GW_Armando_plate1_TTGW07 GW_Armando_plate1_TTGW07 SU lud_Sukhto lud_central 47.0
43 │ GW_Armando_plate1_TTGW10 GW_Armando_plate1_TTGW10 SU lud_Sukhto lud_central 47.0
44 │ GW_Armando_plate1_TTGW11 GW_Armando_plate1_TTGW11 SU lud_Sukhto lud_central 47.0
45 │ GW_Armando_plate1_TTGW13 GW_Armando_plate1_TTGW13 TH lud_Thallighar lud_central 43.0
46 │ GW_Armando_plate1_TTGW17 GW_Armando_plate1_TTGW17 TH lud_Thallighar lud_central 43.0
47 │ GW_Armando_plate1_TTGW19 GW_Armando_plate1_TTGW19 TH lud_Thallighar lud_central 43.0
48 │ GW_Armando_plate1_TTGW21 GW_Armando_plate1_TTGW21 SR lud_Sural lud_central 45.0
49 │ GW_Armando_plate1_TTGW22 GW_Armando_plate1_TTGW22 SR lud_Sural lud_central 45.0
50 │ GW_Armando_plate1_TTGW23 GW_Armando_plate1_TTGW23 SR lud_Sural lud_central 45.0
51 │ GW_Armando_plate1_TTGW29 GW_Armando_plate1_TTGW29 SR lud_Sural lud_central 45.0
52 │ GW_Armando_plate1_TTGW52 GW_Armando_plate1_TTGW52 NG lud_Nainaghar lud_central 49.0
53 │ GW_Armando_plate1_TTGW53 GW_Armando_plate1_TTGW53 NG lud_Nainaghar lud_central 49.0
54 │ GW_Armando_plate1_TTGW55 GW_Armando_plate1_TTGW55 NG lud_Nainaghar lud_central 49.0
55 │ GW_Armando_plate1_TTGW57 GW_Armando_plate1_TTGW57 NG lud_Nainaghar lud_central 49.0
56 │ GW_Armando_plate1_TTGW58 GW_Armando_plate1_TTGW58 NG lud_Nainaghar lud_central 49.0
57 │ GW_Armando_plate1_TTGW59 GW_Armando_plate1_TTGW59 NG lud_Nainaghar lud_central 49.0
58 │ GW_Armando_plate1_TTGW63 GW_Armando_plate1_TTGW63 SP lud_Spiti troch_west 55.0
59 │ GW_Armando_plate1_TTGW64 GW_Armando_plate1_TTGW64 SP lud_Spiti troch_west 55.0
60 │ GW_Armando_plate1_TTGW65 GW_Armando_plate1_TTGW65 SP lud_Spiti troch_west 55.0
61 │ GW_Armando_plate1_TTGW66 GW_Armando_plate1_TTGW66 SP lud_Spiti troch_west 55.0
62 │ GW_Armando_plate1_TTGW68 GW_Armando_plate1_TTGW68 SP lud_Spiti troch_west 55.0
63 │ GW_Armando_plate1_TTGW70 GW_Armando_plate1_TTGW70 SA lud_Sathrundi lud_Sath 41.0
64 │ GW_Armando_plate1_TTGW71 GW_Armando_plate1_TTGW71 SA lud_Sathrundi lud_Sath 41.0
65 │ GW_Armando_plate1_TTGW72 GW_Armando_plate1_TTGW72 SA lud_Sathrundi lud_Sath 41.0
66 │ GW_Armando_plate1_TTGW74 GW_Armando_plate1_TTGW74 SA lud_Sathrundi lud_Sath 41.0
67 │ GW_Armando_plate1_TTGW78 GW_Armando_plate1_TTGW78 SA lud_Sathrundi lud_Sath 41.0
68 │ GW_Armando_plate1_TTGW_15_05 GW_Armando_plate1_TTGW_15_05 SR lud_Sural lud_central 45.0
69 │ GW_Armando_plate1_TTGW_15_07 GW_Armando_plate1_TTGW_15_07 SR lud_Sural lud_central 45.0
70 │ GW_Armando_plate1_TTGW_15_08 GW_Armando_plate1_TTGW_15_08 SR lud_Sural lud_central 45.0
71 │ GW_Armando_plate1_TTGW_15_09 GW_Armando_plate1_TTGW_15_09 SR lud_Sural lud_central 45.0
72 │ GW_Armando_plate1_UY1 GW_Armando_plate1_UY1 UY plumb plumb 87.0
73 │ GW_Armando_plate2_IL2 GW_Armando_plate2_IL2 IL_rep plumb_rep plumb_rep 84.0
74 │ GW_Armando_plate2_JE31G01 GW_Armando_plate2_JE31G01 VB_vi vir_misID vir 24.002
75 │ GW_Armando_plate2_JF03G01 GW_Armando_plate2_JF03G01 ST_vi vir_misID vir 24.003
76 │ GW_Armando_plate2_JF03G02 GW_Armando_plate2_JF03G02 VB_vi vir_misID vir 24.004
77 │ GW_Armando_plate2_JF07G01 GW_Armando_plate2_JF07G01 ST plumb plumb 128.0
78 │ GW_Armando_plate2_JF08G04 GW_Armando_plate2_JF08G04 ST plumb plumb 129.0
79 │ GW_Armando_plate2_JF10G02 GW_Armando_plate2_JF10G02 ST plumb plumb 130.0
80 │ GW_Armando_plate2_JF11G02 GW_Armando_plate2_JF11G02 ST plumb plumb 131.0
81 │ GW_Armando_plate2_JF12G03 GW_Armando_plate2_JF12G03 ST plumb plumb 132.0
82 │ GW_Armando_plate2_JF12G05 GW_Armando_plate2_JF12G05 ST plumb plumb 133.0
83 │ GW_Armando_plate2_JF13G02 GW_Armando_plate2_JF13G02 ST plumb plumb 134.0
84 │ GW_Armando_plate2_JF14G01 GW_Armando_plate2_JF14G01 DV plumb plumb 104.0
85 │ GW_Armando_plate2_JF14G02 GW_Armando_plate2_JF14G02 DV plumb plumb 105.0
86 │ GW_Armando_plate2_JF15G01 GW_Armando_plate2_JF15G01 DV plumb plumb 106.0
87 │ GW_Armando_plate2_JF15G02 GW_Armando_plate2_JF15G02 DV plumb plumb 107.0
88 │ GW_Armando_plate2_JF16G02 GW_Armando_plate2_JF16G02 DV_vi plumb_vir vir 24.042
89 │ GW_Armando_plate2_JF19G01 GW_Armando_plate2_JF19G01 MB plumb plumb 96.0
90 │ GW_Armando_plate2_JF20G02 GW_Armando_plate2_JF20G02 MB plumb plumb 97.0
91 │ GW_Armando_plate2_JF24G01 GW_Armando_plate2_JF24G01 VB plumb plumb 101.0
92 │ GW_Armando_plate2_JF24G03 GW_Armando_plate2_JF24G03 ST plumb plumb 135.0
93 │ GW_Armando_plate2_JF25G01 GW_Armando_plate2_JF25G01 VB plumb plumb 102.0
94 │ GW_Armando_plate2_JF26G02 GW_Armando_plate2_JF26G02 ST plumb plumb 136.0
95 │ GW_Armando_plate2_JF27G02 GW_Armando_plate2_JF27G02 ST plumb plumb 137.0
96 │ GW_Armando_plate2_JF30G01 GW_Armando_plate2_JF30G01 ST_vi vir_misID vir 24.005
97 │ GW_Armando_plate2_JG01G01 GW_Armando_plate2_JG01G01 PR plumb plumb 147.0
98 │ GW_Armando_plate2_JG02G01 GW_Armando_plate2_JG02G01 PR plumb plumb 148.0
99 │ GW_Armando_plate2_JG02G03 GW_Armando_plate2_JG02G03 PR plumb plumb 149.0
100 │ GW_Armando_plate2_JG10G02 GW_Armando_plate2_JG10G02 ST plumb plumb 138.0
101 │ GW_Armando_plate2_JG10G03 GW_Armando_plate2_JG10G03 ST plumb plumb 139.0
102 │ GW_Armando_plate2_JG12G02 GW_Armando_plate2_JG12G02 ST plumb plumb 140.0
103 │ GW_Armando_plate2_JG12G03 GW_Armando_plate2_JG12G03 ST plumb plumb 141.0
104 │ GW_Armando_plate2_LN11 GW_Armando_plate2_LN11 LN_rep troch_LN_rep troch_LN_rep 65.01
105 │ GW_Armando_plate2_LN2 GW_Armando_plate2_LN2 LN troch_LN troch_LN 58.01
106 │ GW_Armando_plate2_NO_BC_TTGW05 GW_Armando_plate2_NO_BC_TTGW05 blank blank blank -99.0
107 │ GW_Armando_plate2_NO_DNA GW_Armando_plate2_NO_DNA blank blank blank -99.0
108 │ GW_Armando_plate2_RF29G01 GW_Armando_plate2_RF29G01 BJ obs_plumb plumb_BJ 77.503
109 │ GW_Armando_plate2_TTGW02 GW_Armando_plate2_TTGW02 MN troch_MN troch_west 53.0
110 │ GW_Armando_plate2_TTGW03 GW_Armando_plate2_TTGW03 MN troch_MN troch_west 53.0
111 │ GW_Armando_plate2_TTGW05_rep3 GW_Armando_plate2_TTGW05_rep3 MN_rep troch_MN_rep troch_west_rep 53.0
112 │ GW_Armando_plate2_TTGW05_rep4 GW_Armando_plate2_TTGW05_rep4 MN_rep troch_MN_rep troch_west_rep 53.0
113 │ GW_Armando_plate2_TTGW08 GW_Armando_plate2_TTGW08 SU lud_Sukhto lud_central 47.0
114 │ GW_Armando_plate2_TTGW09 GW_Armando_plate2_TTGW09 SU lud_Sukhto lud_central 47.0
115 │ GW_Armando_plate2_TTGW12 GW_Armando_plate2_TTGW12 TH lud_Thallighar lud_central 43.0
116 │ GW_Armando_plate2_TTGW14 GW_Armando_plate2_TTGW14 TH lud_Thallighar lud_central 43.0
117 │ GW_Armando_plate2_TTGW15 GW_Armando_plate2_TTGW15 TH lud_Thallighar lud_central 43.0
118 │ GW_Armando_plate2_TTGW16 GW_Armando_plate2_TTGW16 TH lud_Thallighar lud_central 43.0
119 │ GW_Armando_plate2_TTGW18 GW_Armando_plate2_TTGW18 TH lud_Thallighar lud_central 43.0
120 │ GW_Armando_plate2_TTGW20 GW_Armando_plate2_TTGW20 SR lud_Sural lud_central 45.0
121 │ GW_Armando_plate2_TTGW24 GW_Armando_plate2_TTGW24 SR lud_Sural lud_central 45.0
122 │ GW_Armando_plate2_TTGW25 GW_Armando_plate2_TTGW25 SR lud_Sural lud_central 45.0
123 │ GW_Armando_plate2_TTGW27 GW_Armando_plate2_TTGW27 SR lud_Sural lud_central 45.0
124 │ GW_Armando_plate2_TTGW28 GW_Armando_plate2_TTGW28 SR lud_Sural lud_central 45.0
125 │ GW_Armando_plate2_TTGW50 GW_Armando_plate2_TTGW50 NG lud_Nainaghar lud_central 49.0
126 │ GW_Armando_plate2_TTGW51 GW_Armando_plate2_TTGW51 NG lud_Nainaghar lud_central 49.0
127 │ GW_Armando_plate2_TTGW54 GW_Armando_plate2_TTGW54 NG lud_Nainaghar lud_central 49.0
128 │ GW_Armando_plate2_TTGW56 GW_Armando_plate2_TTGW56 NG lud_Nainaghar lud_central 49.0
129 │ GW_Armando_plate2_TTGW60 GW_Armando_plate2_TTGW60 SP lud_Spiti troch_west 55.0
130 │ GW_Armando_plate2_TTGW61 GW_Armando_plate2_TTGW61 SP lud_Spiti troch_west 55.0
131 │ GW_Armando_plate2_TTGW62 GW_Armando_plate2_TTGW62 SP lud_Spiti troch_west 55.0
132 │ GW_Armando_plate2_TTGW67 GW_Armando_plate2_TTGW67 SP lud_Spiti troch_west 55.0
133 │ GW_Armando_plate2_TTGW69 GW_Armando_plate2_TTGW69 SP lud_Spiti troch_west 55.0
134 │ GW_Armando_plate2_TTGW73 GW_Armando_plate2_TTGW73 SA lud_Sathrundi lud_Sath 41.0
135 │ GW_Armando_plate2_TTGW75 GW_Armando_plate2_TTGW75 SA lud_Sathrundi lud_Sath 41.0
136 │ GW_Armando_plate2_TTGW77 GW_Armando_plate2_TTGW77 SA lud_Sathrundi lud_Sath 41.0
137 │ GW_Armando_plate2_TTGW79 GW_Armando_plate2_TTGW79 SA lud_Sathrundi lud_Sath 41.0
138 │ GW_Armando_plate2_TTGW80 GW_Armando_plate2_TTGW80 SA lud_Sathrundi lud_Sath 41.0
139 │ GW_Armando_plate2_TTGW_15_01 GW_Armando_plate2_TTGW_15_01 SR lud_Sural lud_central 45.0
140 │ GW_Armando_plate2_TTGW_15_02 GW_Armando_plate2_TTGW_15_02 SR lud_Sural lud_central 45.0
141 │ GW_Armando_plate2_TTGW_15_03 GW_Armando_plate2_TTGW_15_03 SR lud_Sural lud_central 45.0
142 │ GW_Armando_plate2_TTGW_15_04 GW_Armando_plate2_TTGW_15_04 SR lud_Sural lud_central 45.0
143 │ GW_Armando_plate2_TTGW_15_06 GW_Armando_plate2_TTGW_15_06 SR lud_Sural lud_central 45.0
144 │ GW_Armando_plate2_TTGW_15_10 GW_Armando_plate2_TTGW_15_10 SR lud_Sural lud_central 45.0
145 │ GW_Lane5_AA1 GW_Lane5_AA1 AA vir_S vir_S 25.0
146 │ GW_Lane5_AA10 GW_Lane5_AA10 AA vir_S vir_S 33.0
147 │ GW_Lane5_AA11 GW_Lane5_AA11 AA vir_S vir_S 34.0
148 │ GW_Lane5_AA3 GW_Lane5_AA3 AA vir_S vir_S 26.0
149 │ GW_Lane5_AA4 GW_Lane5_AA4 AA vir_S vir_S 27.0
150 │ GW_Lane5_AA5 GW_Lane5_AA5 AA vir_S vir_S 28.0
151 │ GW_Lane5_AA6 GW_Lane5_AA6 AA vir_S vir_S 29.0
152 │ GW_Lane5_AA7 GW_Lane5_AA7 AA vir_S vir_S 30.0
153 │ GW_Lane5_AA8 GW_Lane5_AA8 AA vir_S vir_S 31.0
154 │ GW_Lane5_AA9 GW_Lane5_AA9 AA vir_S vir_S 32.0
155 │ GW_Lane5_AB1 GW_Lane5_AB1 AB_rep vir_rep vir_rep 20.0
156 │ GW_Lane5_AB2 GW_Lane5_AB2 AB vir vir 21.0
157 │ GW_Lane5_AN1 GW_Lane5_AN1 AN plumb plumb 80.0
158 │ GW_Lane5_AN2 GW_Lane5_AN2 AN plumb plumb 81.0
159 │ GW_Lane5_BK2 GW_Lane5_BK2 BK plumb plumb 78.0
160 │ GW_Lane5_BK3 GW_Lane5_BK3 BK plumb plumb 79.0
161 │ GW_Lane5_DA2 GW_Lane5_DA2 XN obs obs 73.0
162 │ GW_Lane5_DA3 GW_Lane5_DA3 XN obs obs 74.0
163 │ GW_Lane5_DA4 GW_Lane5_DA4 XN obs obs 75.0
164 │ GW_Lane5_DA6 GW_Lane5_DA6 XN obs low_reads 76.0
165 │ GW_Lane5_DA7 GW_Lane5_DA7 XN obs obs 77.0
166 │ GW_Lane5_EM1 GW_Lane5_EM1 EM troch_EM troch_EM 72.0
167 │ GW_Lane5_IL1 GW_Lane5_IL1 IL plumb plumb 82.0
168 │ GW_Lane5_IL2 GW_Lane5_IL2 IL_rep plumb_rep plumb_rep 85.0
169 │ GW_Lane5_IL4 GW_Lane5_IL4 IL plumb plumb 83.0
170 │ GW_Lane5_KS1 GW_Lane5_KS1 OV lud_KS lud_KS 40.0
171 │ GW_Lane5_KS2 GW_Lane5_KS2 OV lud_KS lud_KS 40.0
172 │ GW_Lane5_LN1 GW_Lane5_LN1 LN troch_LN troch_LN 57.0
173 │ GW_Lane5_LN10 GW_Lane5_LN10 LN troch_LN troch_LN 64.0
174 │ GW_Lane5_LN11 GW_Lane5_LN11 LN troch_LN troch_LN 65.0
175 │ GW_Lane5_LN12 GW_Lane5_LN12 LN troch_LN troch_LN 66.0
176 │ GW_Lane5_LN14 GW_Lane5_LN14 LN troch_LN troch_LN 67.0
177 │ GW_Lane5_LN16 GW_Lane5_LN16 LN troch_LN troch_LN 68.0
178 │ GW_Lane5_LN18 GW_Lane5_LN18 LN troch_LN troch_LN 69.0
179 │ GW_Lane5_LN19 GW_Lane5_LN19 LN troch_LN troch_LN 70.0
180 │ GW_Lane5_LN2 GW_Lane5_LN2 LN_rep troch_LN_rep troch_LN_rep 58.0
181 │ GW_Lane5_LN20 GW_Lane5_LN20 LN troch_LN troch_LN 71.0
182 │ GW_Lane5_LN3 GW_Lane5_LN3 LN troch_LN troch_LN 59.0
183 │ GW_Lane5_LN4 GW_Lane5_LN4 LN troch_LN troch_LN 60.0
184 │ GW_Lane5_LN6 GW_Lane5_LN6 LN troch_LN troch_LN 61.0
185 │ GW_Lane5_LN7 GW_Lane5_LN7 LN troch_LN troch_LN 62.0
186 │ GW_Lane5_LN8 GW_Lane5_LN8 LN troch_LN troch_LN 63.0
187 │ GW_Lane5_MN1 GW_Lane5_MN1 MN troch_MN troch_west 51.0
188 │ GW_Lane5_MN12 GW_Lane5_MN12 MN troch_MN troch_west 56.0
189 │ GW_Lane5_MN3 GW_Lane5_MN3 MN troch_MN troch_west 52.0
190 │ GW_Lane5_MN5 GW_Lane5_MN5 MN troch_MN troch_west 53.0
191 │ GW_Lane5_MN8 GW_Lane5_MN8 MN troch_MN troch_west 54.0
192 │ GW_Lane5_MN9 GW_Lane5_MN9 MN troch_MN troch_west 55.0
193 │ GW_Lane5_NA1 GW_Lane5_NA1 NR lud_PK lud_PK 39.2
194 │ GW_Lane5_NA3-3ul GW_Lane5_NA3-3ul NR lud_PK lud_PK 39.2
195 │ GW_Lane5_PT11 GW_Lane5_PT11 KL lud_KL lud_central 42.0
196 │ GW_Lane5_PT12 GW_Lane5_PT12 KL lud_KL lud_central 42.0
197 │ GW_Lane5_PT2 GW_Lane5_PT2 ML lud_ML lud_ML 51.0
198 │ GW_Lane5_PT3 GW_Lane5_PT3 PA lud_PA lud_central 46.0
199 │ GW_Lane5_PT4 GW_Lane5_PT4 PA lud_PA lud_central 46.0
200 │ GW_Lane5_PT6 GW_Lane5_PT6 KL lud_KL lud_central 42.0
201 │ GW_Lane5_SH1 GW_Lane5_SH1 SH lud_PK lud_PK 39.1
202 │ GW_Lane5_SH2 GW_Lane5_SH2 SH lud_PK lud_PK 39.1
203 │ GW_Lane5_SH4 GW_Lane5_SH4 SH lud_PK lud_PK 39.1
204 │ GW_Lane5_SH5 GW_Lane5_SH5 SH lud_PK lud_PK 39.1
205 │ GW_Lane5_SL1 GW_Lane5_SL1 SL plumb plumb 150.0
206 │ GW_Lane5_SL2 GW_Lane5_SL2 SL plumb plumb 151.0
207 │ GW_Lane5_ST1 GW_Lane5_ST1 ST plumb plumb 142.0
208 │ GW_Lane5_ST12 GW_Lane5_ST12 ST plumb plumb 144.0
209 │ GW_Lane5_ST3 GW_Lane5_ST3 ST plumb plumb 143.0
210 │ GW_Lane5_STvi1 GW_Lane5_STvi1 ST_vi vir vir 22.0
211 │ GW_Lane5_STvi2 GW_Lane5_STvi2 ST_vi vir vir 23.0
212 │ GW_Lane5_STvi3 GW_Lane5_STvi3 ST_vi vir vir 24.0
213 │ GW_Lane5_TA1 GW_Lane5_TA1 TA plumb plumb 86.0
214 │ GW_Lane5_TL1 GW_Lane5_TL1 TL vir vir 9.0
215 │ GW_Lane5_TL10 GW_Lane5_TL10 TL vir vir 17.0
216 │ GW_Lane5_TL11 GW_Lane5_TL11 TL vir vir 18.0
217 │ GW_Lane5_TL12 GW_Lane5_TL12 TL vir vir 19.0
218 │ GW_Lane5_TL2 GW_Lane5_TL2 TL vir vir 10.0
219 │ GW_Lane5_TL3 GW_Lane5_TL3 TL_rep vir_rep vir_rep 11.0
220 │ GW_Lane5_TL4 GW_Lane5_TL4 TL vir vir 12.0
221 │ GW_Lane5_TL5 GW_Lane5_TL5 TL vir vir 13.0
222 │ GW_Lane5_TL7 GW_Lane5_TL7 TL vir vir 14.0
223 │ GW_Lane5_TL8 GW_Lane5_TL8 TL vir vir 15.0
224 │ GW_Lane5_TL9 GW_Lane5_TL9 TL vir vir 16.0
225 │ GW_Lane5_TU1 GW_Lane5_TU1 TU nit nit 35.0
226 │ GW_Lane5_TU2 GW_Lane5_TU2 TU nit nit 36.0
227 │ GW_Lane5_UY1 GW_Lane5_UY1 UY_rep plumb_rep plumb_rep 93.0
228 │ GW_Lane5_UY2 GW_Lane5_UY2 UY plumb plumb 88.0
229 │ GW_Lane5_UY3 GW_Lane5_UY3 UY plumb plumb 89.0
230 │ GW_Lane5_UY4 GW_Lane5_UY4 UY plumb plumb 90.0
231 │ GW_Lane5_UY5 GW_Lane5_UY5 UY plumb plumb 91.0
232 │ GW_Lane5_UY6 GW_Lane5_UY6 UY plumb plumb 92.0
233 │ GW_Lane5_YK1 GW_Lane5_YK1 YK vir vir 1.0
234 │ GW_Lane5_YK11 GW_Lane5_YK11 YK vir vir 8.0
235 │ GW_Lane5_YK3 GW_Lane5_YK3 YK vir vir 2.0
236 │ GW_Lane5_YK4 GW_Lane5_YK4 YK vir vir 3.0
237 │ GW_Lane5_YK5 GW_Lane5_YK5 YK vir vir 4.0
238 │ GW_Lane5_YK6 GW_Lane5_YK6 YK vir vir 5.0
239 │ GW_Lane5_YK7 GW_Lane5_YK7 YK vir vir 6.0
240 │ GW_Lane5_YK9 GW_Lane5_YK9 YK vir vir 7.0
241 │ GW_Liz_GBS_Liz10045 GW_Liz_GBS_Liz10045 ML lud lud_ML 51.01
242 │ GW_Liz_GBS_Liz10094 GW_Liz_GBS_Liz10094 ML lud lud_ML 51.02
243 │ GW_Liz_GBS_Liz5101 GW_Liz_GBS_Liz5101 ML lud lud_ML 51.03
244 │ GW_Liz_GBS_Liz5101_R GW_Liz_GBS_Liz5101_R ML_rep lud_rep lud_ML_rep 51.04
245 │ GW_Liz_GBS_Liz5118 GW_Liz_GBS_Liz5118 ML lud lud_ML 51.05
246 │ GW_Liz_GBS_Liz5139 GW_Liz_GBS_Liz5139 ML lud lud_ML 51.06
247 │ GW_Liz_GBS_Liz5142 GW_Liz_GBS_Liz5142 ML lud lud_ML 51.07
248 │ GW_Liz_GBS_Liz5144 GW_Liz_GBS_Liz5144 ML lud lud_ML 51.08
249 │ GW_Liz_GBS_Liz5150 GW_Liz_GBS_Liz5150 ML lud lud_ML 51.09
250 │ GW_Liz_GBS_Liz5159 GW_Liz_GBS_Liz5159 ML lud_chick lud_ML 51.1
251 │ GW_Liz_GBS_Liz5162 GW_Liz_GBS_Liz5162 ML lud_chick lud_ML 51.11
252 │ GW_Liz_GBS_Liz5163 GW_Liz_GBS_Liz5163 ML lud_chick lud_ML 51.12
253 │ GW_Liz_GBS_Liz5164 GW_Liz_GBS_Liz5164 ML lud_chick lud_ML 51.13
254 │ GW_Liz_GBS_Liz5165 GW_Liz_GBS_Liz5165 ML lud lud_ML 51.14
255 │ GW_Liz_GBS_Liz5167 GW_Liz_GBS_Liz5167 ML lud_chick lud_ML 51.15
256 │ GW_Liz_GBS_Liz5168 GW_Liz_GBS_Liz5168 ML lud_chick lud_ML 51.16
257 │ GW_Liz_GBS_Liz5169 GW_Liz_GBS_Liz5169 ML lud_chick lud_ML 51.17
258 │ GW_Liz_GBS_Liz5171 GW_Liz_GBS_Liz5171 ML lud lud_ML 51.18
259 │ GW_Liz_GBS_Liz5172 GW_Liz_GBS_Liz5172 ML lud_chick lud_ML 51.19
260 │ GW_Liz_GBS_Liz5173 GW_Liz_GBS_Liz5173 ML lud_chick lud_ML 51.2
261 │ GW_Liz_GBS_Liz5174 GW_Liz_GBS_Liz5174 ML lud lud_ML 51.21
262 │ GW_Liz_GBS_Liz5175 GW_Liz_GBS_Liz5175 ML lud lud_ML 51.22
263 │ GW_Liz_GBS_Liz5176 GW_Liz_GBS_Liz5176 ML lud lud_ML 51.23
264 │ GW_Liz_GBS_Liz5177 GW_Liz_GBS_Liz5177 ML lud_chick lud_ML 51.24
265 │ GW_Liz_GBS_Liz5178 GW_Liz_GBS_Liz5178 ML lud_chick lud_ML 51.25
266 │ GW_Liz_GBS_Liz5179 GW_Liz_GBS_Liz5179 ML lud_chick lud_ML 51.26
267 │ GW_Liz_GBS_Liz5180 GW_Liz_GBS_Liz5180 ML lud lud_ML 51.27
268 │ GW_Liz_GBS_Liz5182 GW_Liz_GBS_Liz5182 ML lud_chick lud_ML 51.28
269 │ GW_Liz_GBS_Liz5184 GW_Liz_GBS_Liz5184 ML lud_chick lud_ML 51.29
270 │ GW_Liz_GBS_Liz5185 GW_Liz_GBS_Liz5185 ML lud lud_ML 51.3
271 │ GW_Liz_GBS_Liz5186 GW_Liz_GBS_Liz5186 ML lud_chick lud_ML 51.31
272 │ GW_Liz_GBS_Liz5187 GW_Liz_GBS_Liz5187 ML lud_chick lud_ML 51.32
273 │ GW_Liz_GBS_Liz5188 GW_Liz_GBS_Liz5188 ML lud lud_ML 51.33
274 │ GW_Liz_GBS_Liz5189 GW_Liz_GBS_Liz5189 ML lud_chick lud_ML 51.34
275 │ GW_Liz_GBS_Liz5190 GW_Liz_GBS_Liz5190 ML lud_chick lud_ML 51.35
276 │ GW_Liz_GBS_Liz5191 GW_Liz_GBS_Liz5191 ML lud_chick lud_ML 51.36
277 │ GW_Liz_GBS_Liz5192 GW_Liz_GBS_Liz5192 ML lud_chick lud_ML 51.37
278 │ GW_Liz_GBS_Liz5193 GW_Liz_GBS_Liz5193 ML lud_chick lud_ML 51.38
279 │ GW_Liz_GBS_Liz5194 GW_Liz_GBS_Liz5194 ML lud_chick lud_ML 51.39
280 │ GW_Liz_GBS_Liz5195 GW_Liz_GBS_Liz5195 ML lud lud_ML 51.4
281 │ GW_Liz_GBS_Liz5197 GW_Liz_GBS_Liz5197 ML lud lud_ML 51.41
282 │ GW_Liz_GBS_Liz5199 GW_Liz_GBS_Liz5199 ML lud_chick lud_ML 51.42
283 │ GW_Liz_GBS_Liz6002 GW_Liz_GBS_Liz6002 ML lud lud_ML 51.43
284 │ GW_Liz_GBS_Liz6006 GW_Liz_GBS_Liz6006 ML lud lud_ML 51.44
285 │ GW_Liz_GBS_Liz6008 GW_Liz_GBS_Liz6008 ML lud lud_ML 51.45
286 │ GW_Liz_GBS_Liz6009 GW_Liz_GBS_Liz6009 ML lud lud_ML 51.46
287 │ GW_Liz_GBS_Liz6010 GW_Liz_GBS_Liz6010 ML lud lud_ML 51.47
288 │ GW_Liz_GBS_Liz6012 GW_Liz_GBS_Liz6012 ML lud lud_ML 51.48
289 │ GW_Liz_GBS_Liz6014 GW_Liz_GBS_Liz6014 ML lud lud_ML 51.49
290 │ GW_Liz_GBS_Liz6055 GW_Liz_GBS_Liz6055 ML lud lud_ML 51.5
291 │ GW_Liz_GBS_Liz6057 GW_Liz_GBS_Liz6057 ML lud lud_ML 51.51
292 │ GW_Liz_GBS_Liz6060 GW_Liz_GBS_Liz6060 ML lud lud_ML 51.52
293 │ GW_Liz_GBS_Liz6062 GW_Liz_GBS_Liz6062 ML lud lud_ML 51.53
294 │ GW_Liz_GBS_Liz6063 GW_Liz_GBS_Liz6063 ML lud lud_ML 51.54
295 │ GW_Liz_GBS_Liz6066 GW_Liz_GBS_Liz6066 ML lud lud_ML 51.55
296 │ GW_Liz_GBS_Liz6072 GW_Liz_GBS_Liz6072 ML lud lud_ML 51.56
297 │ GW_Liz_GBS_Liz6079 GW_Liz_GBS_Liz6079 ML lud lud_ML 51.57
298 │ GW_Liz_GBS_Liz6203 GW_Liz_GBS_Liz6203 ML lud_chick lud_ML 51.58
299 │ GW_Liz_GBS_Liz6204 GW_Liz_GBS_Liz6204 ML lud_chick lud_ML 51.59
300 │ GW_Liz_GBS_Liz6461 GW_Liz_GBS_Liz6461 ML lud lud_ML 51.6
301 │ GW_Liz_GBS_Liz6472 GW_Liz_GBS_Liz6472 ML lud lud_ML 51.61
302 │ GW_Liz_GBS_Liz6478 GW_Liz_GBS_Liz6478 ML lud lud_ML 51.62
303 │ GW_Liz_GBS_Liz6766 GW_Liz_GBS_Liz6766 ML lud lud_ML 51.63
304 │ GW_Liz_GBS_Liz6776 GW_Liz_GBS_Liz6776 ML lud lud_ML 51.64
305 │ GW_Liz_GBS_Liz6794 GW_Liz_GBS_Liz6794 ML lud lud_ML 51.65
306 │ GW_Liz_GBS_P_fusc GW_Liz_GBS_P_fusc fusc fusc fusc 201.0
307 │ GW_Liz_GBS_P_h_man GW_Liz_GBS_P_h_man hmand hmand hmand 202.0
308 │ GW_Liz_GBS_P_humei GW_Liz_GBS_P_humei hume hume hume 203.0
309 │ GW_Liz_GBS_P_inor GW_Liz_GBS_P_inor inor inor inor 204.0
310 │ GW_Liz_GBS_S_burk GW_Liz_GBS_S_burk burk burk burk 205.0
GOOD NEWS: names of individuals in metadata file and genotype ind file match perfectly.
Filtering
Filter out duplicate runs (indicated with _rep in Fst_group column)
selection = occursin.("_rep", ind_with_metadata.Fst_group)
println("""Filtering out these runs because they are duplicates of another,
according to having "rep" in Fst_group: """)
display(ind_with_metadata.ind[selection])
ind_with_metadata_indFiltered = ind_with_metadata[Not(selection), :];
geno_indFiltered = view(geno, Not(selection), :); # use of view() avoids copying large memory objectFiltering out these runs because they are duplicates of another,
according to having "rep" in Fst_group:
11-element Vector{String}:
"GW_Armando_plate1_TTGW05_rep1"
"GW_Armando_plate2_IL2"
"GW_Armando_plate2_LN11"
"GW_Armando_plate2_TTGW05_rep3"
"GW_Armando_plate2_TTGW05_rep4"
"GW_Lane5_AB1"
"GW_Lane5_IL2"
"GW_Lane5_LN2"
"GW_Lane5_TL3"
"GW_Lane5_UY1"
"GW_Liz_GBS_Liz5101_R"
Filter specific individuals
If there are certain individuals that we want to filter out prior to any additional analysis, we can do so here by setting filter to true and specifying the individual row numbers in filter_out_inds:
filter = true
# Specify individuals to filter out:
filter_out_inds = ["GW_Liz_GBS_P_fusc", "GW_Liz_GBS_P_h_man", "GW_Liz_GBS_P_humei", "GW_Liz_GBS_P_inor", "GW_Liz_GBS_S_burk"]
if filter
selection = map(in(filter_out_inds), ind_with_metadata_indFiltered.ind)
filtered_out = ind_with_metadata_indFiltered.ind[selection]
ind_with_metadata_indFiltered = ind_with_metadata_indFiltered[Not(selection), :]
geno_indFiltered = view(geno_indFiltered, Not(selection), :)
println("Specific individuals filtered out as requested: ")
display(filtered_out)
else
println("No specific individuals filtered (because filter not true)")
endSpecific individuals filtered out as requested:
5-element Vector{String}:
"GW_Liz_GBS_P_fusc"
"GW_Liz_GBS_P_h_man"
"GW_Liz_GBS_P_humei"
"GW_Liz_GBS_P_inor"
"GW_Liz_GBS_S_burk"
Filter individuals based on missing genotypes
Here we determine number of missing SNPs per individual (40% for this round), and filter out those individual datasets with more than a certain percent of missing SNPs:
SNPmissing_percent_allowed_per_ind = 40 # this is the percentage threshold
threshold_missing = loci_count * SNPmissing_percent_allowed_per_ind/100
numMissings = sum(geno_indFiltered .== -1, dims=2)
ind_with_metadata_indFiltered.numMissings .= numMissings
selection = vec(numMissings .<= threshold_missing) # the vec command converts to BitVector rather than BitMatrix--important below
println("Filtering out these individuals based on too many missing genotypes: ")
filtered_inds = ind_with_metadata_indFiltered.ind[selection.==false]
println(DataFrame(filtered_inds = filtered_inds)) # did this to print all lines
ind_with_metadata_indFiltered = ind_with_metadata_indFiltered[selection, :]
geno_indFiltered = view(geno_indFiltered, selection, :);
println()
println("Here are the remaining individuals: ")
println(DataFrame(ind_with_metadata_indFiltered))Filtering out these individuals based on too many missing genotypes:
33×1 DataFrame
Row │ filtered_inds
│ String
─────┼────────────────────────────────
1 │ GW_Armando_plate1_JG08G02
2 │ GW_Armando_plate1_JG10G01
3 │ GW_Armando_plate1_NO_BC_TTGW05
4 │ GW_Armando_plate1_NO_DNA
5 │ GW_Armando_plate1_TTGW21
6 │ GW_Armando_plate1_TTGW71
7 │ GW_Armando_plate2_NO_BC_TTGW05
8 │ GW_Armando_plate2_NO_DNA
9 │ GW_Armando_plate2_TTGW15
10 │ GW_Lane5_AA10
11 │ GW_Lane5_DA6
12 │ GW_Lane5_LN11
13 │ GW_Liz_GBS_Liz5101
14 │ GW_Liz_GBS_Liz5118
15 │ GW_Liz_GBS_Liz5139
16 │ GW_Liz_GBS_Liz5142
17 │ GW_Liz_GBS_Liz5150
18 │ GW_Liz_GBS_Liz5159
19 │ GW_Liz_GBS_Liz5162
20 │ GW_Liz_GBS_Liz5169
21 │ GW_Liz_GBS_Liz5171
22 │ GW_Liz_GBS_Liz5172
23 │ GW_Liz_GBS_Liz5174
24 │ GW_Liz_GBS_Liz5176
25 │ GW_Liz_GBS_Liz5177
26 │ GW_Liz_GBS_Liz5180
27 │ GW_Liz_GBS_Liz5186
28 │ GW_Liz_GBS_Liz5187
29 │ GW_Liz_GBS_Liz5192
30 │ GW_Liz_GBS_Liz5195
31 │ GW_Liz_GBS_Liz6012
32 │ GW_Liz_GBS_Liz6203
33 │ GW_Liz_GBS_Liz6766
Here are the remaining individuals:
261×7 DataFrame
Row │ ind ID location group Fst_group plot_order numMissings
│ String String31 String7 String15 String15 Float64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ GW_Armando_plate1_AB1 GW_Armando_plate1_AB1 AB vir vir 20.01 99270
2 │ GW_Armando_plate1_JF07G02 GW_Armando_plate1_JF07G02 ST plumb plumb 108.0 181208
3 │ GW_Armando_plate1_JF07G03 GW_Armando_plate1_JF07G03 ST plumb plumb 109.0 116334
4 │ GW_Armando_plate1_JF07G04 GW_Armando_plate1_JF07G04 ST plumb plumb 110.0 116065
5 │ GW_Armando_plate1_JF08G02 GW_Armando_plate1_JF08G02 ST plumb plumb 111.0 87194
6 │ GW_Armando_plate1_JF09G01 GW_Armando_plate1_JF09G01 ST plumb plumb 112.0 237912
7 │ GW_Armando_plate1_JF09G02 GW_Armando_plate1_JF09G02 ST plumb plumb 113.0 110439
8 │ GW_Armando_plate1_JF10G03 GW_Armando_plate1_JF10G03 ST plumb_vir plumb_vir 170.0 202891
9 │ GW_Armando_plate1_JF11G01 GW_Armando_plate1_JF11G01 ST plumb plumb 114.0 112391
10 │ GW_Armando_plate1_JF12G01 GW_Armando_plate1_JF12G01 ST plumb plumb 115.0 129490
11 │ GW_Armando_plate1_JF12G02 GW_Armando_plate1_JF12G02 ST plumb plumb 116.0 130718
12 │ GW_Armando_plate1_JF12G04 GW_Armando_plate1_JF12G04 ST_vi vir vir 24.001 133969
13 │ GW_Armando_plate1_JF13G01 GW_Armando_plate1_JF13G01 ST plumb plumb 117.0 107508
14 │ GW_Armando_plate1_JF15G03 GW_Armando_plate1_JF15G03 DV plumb plumb 103.0 110933
15 │ GW_Armando_plate1_JF16G01 GW_Armando_plate1_JF16G01 DV_vi plumb_vir vir 24.041 122599
16 │ GW_Armando_plate1_JF20G01 GW_Armando_plate1_JF20G01 MB plumb plumb 94.0 102740
17 │ GW_Armando_plate1_JF22G01 GW_Armando_plate1_JF22G01 MB plumb plumb 95.0 99501
18 │ GW_Armando_plate1_JF23G01 GW_Armando_plate1_JF23G01 VB plumb plumb 98.0 113052
19 │ GW_Armando_plate1_JF23G02 GW_Armando_plate1_JF23G02 VB plumb plumb 99.0 87860
20 │ GW_Armando_plate1_JF24G02 GW_Armando_plate1_JF24G02 VB plumb plumb 100.0 88105
21 │ GW_Armando_plate1_JF26G01 GW_Armando_plate1_JF26G01 ST plumb plumb 118.0 104432
22 │ GW_Armando_plate1_JF27G01 GW_Armando_plate1_JF27G01 ST plumb plumb 119.0 96001
23 │ GW_Armando_plate1_JF29G01 GW_Armando_plate1_JF29G01 ST plumb plumb 120.0 86530
24 │ GW_Armando_plate1_JF29G02 GW_Armando_plate1_JF29G02 ST plumb plumb 121.0 101762
25 │ GW_Armando_plate1_JF29G03 GW_Armando_plate1_JF29G03 ST plumb plumb 122.0 107333
26 │ GW_Armando_plate1_JG02G02 GW_Armando_plate1_JG02G02 PR plumb plumb 145.0 81977
27 │ GW_Armando_plate1_JG02G04 GW_Armando_plate1_JG02G04 PR plumb plumb 146.0 140445
28 │ GW_Armando_plate1_JG08G01 GW_Armando_plate1_JG08G01 ST plumb plumb 123.0 131448
29 │ GW_Armando_plate1_JG12G01 GW_Armando_plate1_JG12G01 ST plumb plumb 126.0 132475
30 │ GW_Armando_plate1_JG17G01 GW_Armando_plate1_JG17G01 ST plumb_vir plumb 127.0 106045
31 │ GW_Armando_plate1_RF20G01 GW_Armando_plate1_RF20G01 BJ obs_plumb plumb_BJ 77.501 102115
32 │ GW_Armando_plate1_RF29G02 GW_Armando_plate1_RF29G02 BJ obs_plumb plumb_BJ 77.502 99695
33 │ GW_Armando_plate1_TL3 GW_Armando_plate1_TL3 TL vir vir 11.01 119043
34 │ GW_Armando_plate1_TTGW01 GW_Armando_plate1_TTGW01 MN troch_MN troch_west 53.0 80329
35 │ GW_Armando_plate1_TTGW05_rep2 GW_Armando_plate1_TTGW05_rep2 MN troch_MN troch_west 53.0 95518
36 │ GW_Armando_plate1_TTGW06 GW_Armando_plate1_TTGW06 SU lud_Sukhto lud_central 47.0 80282
37 │ GW_Armando_plate1_TTGW07 GW_Armando_plate1_TTGW07 SU lud_Sukhto lud_central 47.0 153267
38 │ GW_Armando_plate1_TTGW10 GW_Armando_plate1_TTGW10 SU lud_Sukhto lud_central 47.0 97805
39 │ GW_Armando_plate1_TTGW11 GW_Armando_plate1_TTGW11 SU lud_Sukhto lud_central 47.0 93712
40 │ GW_Armando_plate1_TTGW13 GW_Armando_plate1_TTGW13 TH lud_Thallighar lud_central 43.0 87677
41 │ GW_Armando_plate1_TTGW17 GW_Armando_plate1_TTGW17 TH lud_Thallighar lud_central 43.0 100310
42 │ GW_Armando_plate1_TTGW19 GW_Armando_plate1_TTGW19 TH lud_Thallighar lud_central 43.0 100019
43 │ GW_Armando_plate1_TTGW22 GW_Armando_plate1_TTGW22 SR lud_Sural lud_central 45.0 95151
44 │ GW_Armando_plate1_TTGW23 GW_Armando_plate1_TTGW23 SR lud_Sural lud_central 45.0 218168
45 │ GW_Armando_plate1_TTGW29 GW_Armando_plate1_TTGW29 SR lud_Sural lud_central 45.0 101068
46 │ GW_Armando_plate1_TTGW52 GW_Armando_plate1_TTGW52 NG lud_Nainaghar lud_central 49.0 89055
47 │ GW_Armando_plate1_TTGW53 GW_Armando_plate1_TTGW53 NG lud_Nainaghar lud_central 49.0 76154
48 │ GW_Armando_plate1_TTGW55 GW_Armando_plate1_TTGW55 NG lud_Nainaghar lud_central 49.0 84162
49 │ GW_Armando_plate1_TTGW57 GW_Armando_plate1_TTGW57 NG lud_Nainaghar lud_central 49.0 93413
50 │ GW_Armando_plate1_TTGW58 GW_Armando_plate1_TTGW58 NG lud_Nainaghar lud_central 49.0 102386
51 │ GW_Armando_plate1_TTGW59 GW_Armando_plate1_TTGW59 NG lud_Nainaghar lud_central 49.0 98471
52 │ GW_Armando_plate1_TTGW63 GW_Armando_plate1_TTGW63 SP lud_Spiti troch_west 55.0 98425
53 │ GW_Armando_plate1_TTGW64 GW_Armando_plate1_TTGW64 SP lud_Spiti troch_west 55.0 111234
54 │ GW_Armando_plate1_TTGW65 GW_Armando_plate1_TTGW65 SP lud_Spiti troch_west 55.0 119729
55 │ GW_Armando_plate1_TTGW66 GW_Armando_plate1_TTGW66 SP lud_Spiti troch_west 55.0 105443
56 │ GW_Armando_plate1_TTGW68 GW_Armando_plate1_TTGW68 SP lud_Spiti troch_west 55.0 98099
57 │ GW_Armando_plate1_TTGW70 GW_Armando_plate1_TTGW70 SA lud_Sathrundi lud_Sath 41.0 84930
58 │ GW_Armando_plate1_TTGW72 GW_Armando_plate1_TTGW72 SA lud_Sathrundi lud_Sath 41.0 92370
59 │ GW_Armando_plate1_TTGW74 GW_Armando_plate1_TTGW74 SA lud_Sathrundi lud_Sath 41.0 325539
60 │ GW_Armando_plate1_TTGW78 GW_Armando_plate1_TTGW78 SA lud_Sathrundi lud_Sath 41.0 96540
61 │ GW_Armando_plate1_TTGW_15_05 GW_Armando_plate1_TTGW_15_05 SR lud_Sural lud_central 45.0 173819
62 │ GW_Armando_plate1_TTGW_15_07 GW_Armando_plate1_TTGW_15_07 SR lud_Sural lud_central 45.0 84861
63 │ GW_Armando_plate1_TTGW_15_08 GW_Armando_plate1_TTGW_15_08 SR lud_Sural lud_central 45.0 88322
64 │ GW_Armando_plate1_TTGW_15_09 GW_Armando_plate1_TTGW_15_09 SR lud_Sural lud_central 45.0 87883
65 │ GW_Armando_plate1_UY1 GW_Armando_plate1_UY1 UY plumb plumb 87.0 97598
66 │ GW_Armando_plate2_JE31G01 GW_Armando_plate2_JE31G01 VB_vi vir_misID vir 24.002 101749
67 │ GW_Armando_plate2_JF03G01 GW_Armando_plate2_JF03G01 ST_vi vir_misID vir 24.003 214510
68 │ GW_Armando_plate2_JF03G02 GW_Armando_plate2_JF03G02 VB_vi vir_misID vir 24.004 296518
69 │ GW_Armando_plate2_JF07G01 GW_Armando_plate2_JF07G01 ST plumb plumb 128.0 104578
70 │ GW_Armando_plate2_JF08G04 GW_Armando_plate2_JF08G04 ST plumb plumb 129.0 101010
71 │ GW_Armando_plate2_JF10G02 GW_Armando_plate2_JF10G02 ST plumb plumb 130.0 95393
72 │ GW_Armando_plate2_JF11G02 GW_Armando_plate2_JF11G02 ST plumb plumb 131.0 89277
73 │ GW_Armando_plate2_JF12G03 GW_Armando_plate2_JF12G03 ST plumb plumb 132.0 118134
74 │ GW_Armando_plate2_JF12G05 GW_Armando_plate2_JF12G05 ST plumb plumb 133.0 85768
75 │ GW_Armando_plate2_JF13G02 GW_Armando_plate2_JF13G02 ST plumb plumb 134.0 113146
76 │ GW_Armando_plate2_JF14G01 GW_Armando_plate2_JF14G01 DV plumb plumb 104.0 97185
77 │ GW_Armando_plate2_JF14G02 GW_Armando_plate2_JF14G02 DV plumb plumb 105.0 94078
78 │ GW_Armando_plate2_JF15G01 GW_Armando_plate2_JF15G01 DV plumb plumb 106.0 89126
79 │ GW_Armando_plate2_JF15G02 GW_Armando_plate2_JF15G02 DV plumb plumb 107.0 101787
80 │ GW_Armando_plate2_JF16G02 GW_Armando_plate2_JF16G02 DV_vi plumb_vir vir 24.042 125147
81 │ GW_Armando_plate2_JF19G01 GW_Armando_plate2_JF19G01 MB plumb plumb 96.0 94354
82 │ GW_Armando_plate2_JF20G02 GW_Armando_plate2_JF20G02 MB plumb plumb 97.0 78032
83 │ GW_Armando_plate2_JF24G01 GW_Armando_plate2_JF24G01 VB plumb plumb 101.0 113526
84 │ GW_Armando_plate2_JF24G03 GW_Armando_plate2_JF24G03 ST plumb plumb 135.0 77983
85 │ GW_Armando_plate2_JF25G01 GW_Armando_plate2_JF25G01 VB plumb plumb 102.0 86955
86 │ GW_Armando_plate2_JF26G02 GW_Armando_plate2_JF26G02 ST plumb plumb 136.0 79995
87 │ GW_Armando_plate2_JF27G02 GW_Armando_plate2_JF27G02 ST plumb plumb 137.0 97159
88 │ GW_Armando_plate2_JF30G01 GW_Armando_plate2_JF30G01 ST_vi vir_misID vir 24.005 117010
89 │ GW_Armando_plate2_JG01G01 GW_Armando_plate2_JG01G01 PR plumb plumb 147.0 97961
90 │ GW_Armando_plate2_JG02G01 GW_Armando_plate2_JG02G01 PR plumb plumb 148.0 81494
91 │ GW_Armando_plate2_JG02G03 GW_Armando_plate2_JG02G03 PR plumb plumb 149.0 96990
92 │ GW_Armando_plate2_JG10G02 GW_Armando_plate2_JG10G02 ST plumb plumb 138.0 78807
93 │ GW_Armando_plate2_JG10G03 GW_Armando_plate2_JG10G03 ST plumb plumb 139.0 91258
94 │ GW_Armando_plate2_JG12G02 GW_Armando_plate2_JG12G02 ST plumb plumb 140.0 102563
95 │ GW_Armando_plate2_JG12G03 GW_Armando_plate2_JG12G03 ST plumb plumb 141.0 91523
96 │ GW_Armando_plate2_LN2 GW_Armando_plate2_LN2 LN troch_LN troch_LN 58.01 77555
97 │ GW_Armando_plate2_RF29G01 GW_Armando_plate2_RF29G01 BJ obs_plumb plumb_BJ 77.503 94504
98 │ GW_Armando_plate2_TTGW02 GW_Armando_plate2_TTGW02 MN troch_MN troch_west 53.0 82355
99 │ GW_Armando_plate2_TTGW03 GW_Armando_plate2_TTGW03 MN troch_MN troch_west 53.0 93915
100 │ GW_Armando_plate2_TTGW08 GW_Armando_plate2_TTGW08 SU lud_Sukhto lud_central 47.0 79030
101 │ GW_Armando_plate2_TTGW09 GW_Armando_plate2_TTGW09 SU lud_Sukhto lud_central 47.0 124074
102 │ GW_Armando_plate2_TTGW12 GW_Armando_plate2_TTGW12 TH lud_Thallighar lud_central 43.0 80774
103 │ GW_Armando_plate2_TTGW14 GW_Armando_plate2_TTGW14 TH lud_Thallighar lud_central 43.0 91726
104 │ GW_Armando_plate2_TTGW16 GW_Armando_plate2_TTGW16 TH lud_Thallighar lud_central 43.0 78610
105 │ GW_Armando_plate2_TTGW18 GW_Armando_plate2_TTGW18 TH lud_Thallighar lud_central 43.0 90831
106 │ GW_Armando_plate2_TTGW20 GW_Armando_plate2_TTGW20 SR lud_Sural lud_central 45.0 70473
107 │ GW_Armando_plate2_TTGW24 GW_Armando_plate2_TTGW24 SR lud_Sural lud_central 45.0 88081
108 │ GW_Armando_plate2_TTGW25 GW_Armando_plate2_TTGW25 SR lud_Sural lud_central 45.0 113545
109 │ GW_Armando_plate2_TTGW27 GW_Armando_plate2_TTGW27 SR lud_Sural lud_central 45.0 92690
110 │ GW_Armando_plate2_TTGW28 GW_Armando_plate2_TTGW28 SR lud_Sural lud_central 45.0 90248
111 │ GW_Armando_plate2_TTGW50 GW_Armando_plate2_TTGW50 NG lud_Nainaghar lud_central 49.0 73271
112 │ GW_Armando_plate2_TTGW51 GW_Armando_plate2_TTGW51 NG lud_Nainaghar lud_central 49.0 82100
113 │ GW_Armando_plate2_TTGW54 GW_Armando_plate2_TTGW54 NG lud_Nainaghar lud_central 49.0 680557
114 │ GW_Armando_plate2_TTGW56 GW_Armando_plate2_TTGW56 NG lud_Nainaghar lud_central 49.0 77725
115 │ GW_Armando_plate2_TTGW60 GW_Armando_plate2_TTGW60 SP lud_Spiti troch_west 55.0 80730
116 │ GW_Armando_plate2_TTGW61 GW_Armando_plate2_TTGW61 SP lud_Spiti troch_west 55.0 96107
117 │ GW_Armando_plate2_TTGW62 GW_Armando_plate2_TTGW62 SP lud_Spiti troch_west 55.0 101894
118 │ GW_Armando_plate2_TTGW67 GW_Armando_plate2_TTGW67 SP lud_Spiti troch_west 55.0 145905
119 │ GW_Armando_plate2_TTGW69 GW_Armando_plate2_TTGW69 SP lud_Spiti troch_west 55.0 91534
120 │ GW_Armando_plate2_TTGW73 GW_Armando_plate2_TTGW73 SA lud_Sathrundi lud_Sath 41.0 72217
121 │ GW_Armando_plate2_TTGW75 GW_Armando_plate2_TTGW75 SA lud_Sathrundi lud_Sath 41.0 92128
122 │ GW_Armando_plate2_TTGW77 GW_Armando_plate2_TTGW77 SA lud_Sathrundi lud_Sath 41.0 75475
123 │ GW_Armando_plate2_TTGW79 GW_Armando_plate2_TTGW79 SA lud_Sathrundi lud_Sath 41.0 96324
124 │ GW_Armando_plate2_TTGW80 GW_Armando_plate2_TTGW80 SA lud_Sathrundi lud_Sath 41.0 75357
125 │ GW_Armando_plate2_TTGW_15_01 GW_Armando_plate2_TTGW_15_01 SR lud_Sural lud_central 45.0 91072
126 │ GW_Armando_plate2_TTGW_15_02 GW_Armando_plate2_TTGW_15_02 SR lud_Sural lud_central 45.0 72691
127 │ GW_Armando_plate2_TTGW_15_03 GW_Armando_plate2_TTGW_15_03 SR lud_Sural lud_central 45.0 74750
128 │ GW_Armando_plate2_TTGW_15_04 GW_Armando_plate2_TTGW_15_04 SR lud_Sural lud_central 45.0 193531
129 │ GW_Armando_plate2_TTGW_15_06 GW_Armando_plate2_TTGW_15_06 SR lud_Sural lud_central 45.0 76538
130 │ GW_Armando_plate2_TTGW_15_10 GW_Armando_plate2_TTGW_15_10 SR lud_Sural lud_central 45.0 219999
131 │ GW_Lane5_AA1 GW_Lane5_AA1 AA vir_S vir_S 25.0 717676
132 │ GW_Lane5_AA11 GW_Lane5_AA11 AA vir_S vir_S 34.0 736085
133 │ GW_Lane5_AA3 GW_Lane5_AA3 AA vir_S vir_S 26.0 662552
134 │ GW_Lane5_AA4 GW_Lane5_AA4 AA vir_S vir_S 27.0 665969
135 │ GW_Lane5_AA5 GW_Lane5_AA5 AA vir_S vir_S 28.0 718610
136 │ GW_Lane5_AA6 GW_Lane5_AA6 AA vir_S vir_S 29.0 778821
137 │ GW_Lane5_AA7 GW_Lane5_AA7 AA vir_S vir_S 30.0 724717
138 │ GW_Lane5_AA8 GW_Lane5_AA8 AA vir_S vir_S 31.0 922313
139 │ GW_Lane5_AA9 GW_Lane5_AA9 AA vir_S vir_S 32.0 791514
140 │ GW_Lane5_AB2 GW_Lane5_AB2 AB vir vir 21.0 915622
141 │ GW_Lane5_AN1 GW_Lane5_AN1 AN plumb plumb 80.0 719041
142 │ GW_Lane5_AN2 GW_Lane5_AN2 AN plumb plumb 81.0 684348
143 │ GW_Lane5_BK2 GW_Lane5_BK2 BK plumb plumb 78.0 666018
144 │ GW_Lane5_BK3 GW_Lane5_BK3 BK plumb plumb 79.0 665090
145 │ GW_Lane5_DA2 GW_Lane5_DA2 XN obs obs 73.0 655980
146 │ GW_Lane5_DA3 GW_Lane5_DA3 XN obs obs 74.0 696722
147 │ GW_Lane5_DA4 GW_Lane5_DA4 XN obs obs 75.0 634940
148 │ GW_Lane5_DA7 GW_Lane5_DA7 XN obs obs 77.0 733664
149 │ GW_Lane5_EM1 GW_Lane5_EM1 EM troch_EM troch_EM 72.0 711636
150 │ GW_Lane5_IL1 GW_Lane5_IL1 IL plumb plumb 82.0 698039
151 │ GW_Lane5_IL4 GW_Lane5_IL4 IL plumb plumb 83.0 726050
152 │ GW_Lane5_KS1 GW_Lane5_KS1 OV lud_KS lud_KS 40.0 751642
153 │ GW_Lane5_KS2 GW_Lane5_KS2 OV lud_KS lud_KS 40.0 820241
154 │ GW_Lane5_LN1 GW_Lane5_LN1 LN troch_LN troch_LN 57.0 707607
155 │ GW_Lane5_LN10 GW_Lane5_LN10 LN troch_LN troch_LN 64.0 783282
156 │ GW_Lane5_LN12 GW_Lane5_LN12 LN troch_LN troch_LN 66.0 884653
157 │ GW_Lane5_LN14 GW_Lane5_LN14 LN troch_LN troch_LN 67.0 753482
158 │ GW_Lane5_LN16 GW_Lane5_LN16 LN troch_LN troch_LN 68.0 738272
159 │ GW_Lane5_LN18 GW_Lane5_LN18 LN troch_LN troch_LN 69.0 711612
160 │ GW_Lane5_LN19 GW_Lane5_LN19 LN troch_LN troch_LN 70.0 731751
161 │ GW_Lane5_LN20 GW_Lane5_LN20 LN troch_LN troch_LN 71.0 738170
162 │ GW_Lane5_LN3 GW_Lane5_LN3 LN troch_LN troch_LN 59.0 702316
163 │ GW_Lane5_LN4 GW_Lane5_LN4 LN troch_LN troch_LN 60.0 683211
164 │ GW_Lane5_LN6 GW_Lane5_LN6 LN troch_LN troch_LN 61.0 690808
165 │ GW_Lane5_LN7 GW_Lane5_LN7 LN troch_LN troch_LN 62.0 758021
166 │ GW_Lane5_LN8 GW_Lane5_LN8 LN troch_LN troch_LN 63.0 662407
167 │ GW_Lane5_MN1 GW_Lane5_MN1 MN troch_MN troch_west 51.0 943574
168 │ GW_Lane5_MN12 GW_Lane5_MN12 MN troch_MN troch_west 56.0 671074
169 │ GW_Lane5_MN3 GW_Lane5_MN3 MN troch_MN troch_west 52.0 946478
170 │ GW_Lane5_MN5 GW_Lane5_MN5 MN troch_MN troch_west 53.0 752979
171 │ GW_Lane5_MN8 GW_Lane5_MN8 MN troch_MN troch_west 54.0 771625
172 │ GW_Lane5_MN9 GW_Lane5_MN9 MN troch_MN troch_west 55.0 897944
173 │ GW_Lane5_NA1 GW_Lane5_NA1 NR lud_PK lud_PK 39.2 919218
174 │ GW_Lane5_NA3-3ul GW_Lane5_NA3-3ul NR lud_PK lud_PK 39.2 798370
175 │ GW_Lane5_PT11 GW_Lane5_PT11 KL lud_KL lud_central 42.0 771797
176 │ GW_Lane5_PT12 GW_Lane5_PT12 KL lud_KL lud_central 42.0 792683
177 │ GW_Lane5_PT2 GW_Lane5_PT2 ML lud_ML lud_ML 51.0 760347
178 │ GW_Lane5_PT3 GW_Lane5_PT3 PA lud_PA lud_central 46.0 722177
179 │ GW_Lane5_PT4 GW_Lane5_PT4 PA lud_PA lud_central 46.0 705413
180 │ GW_Lane5_PT6 GW_Lane5_PT6 KL lud_KL lud_central 42.0 763723
181 │ GW_Lane5_SH1 GW_Lane5_SH1 SH lud_PK lud_PK 39.1 966761
182 │ GW_Lane5_SH2 GW_Lane5_SH2 SH lud_PK lud_PK 39.1 768646
183 │ GW_Lane5_SH4 GW_Lane5_SH4 SH lud_PK lud_PK 39.1 849641
184 │ GW_Lane5_SH5 GW_Lane5_SH5 SH lud_PK lud_PK 39.1 929394
185 │ GW_Lane5_SL1 GW_Lane5_SL1 SL plumb plumb 150.0 648883
186 │ GW_Lane5_SL2 GW_Lane5_SL2 SL plumb plumb 151.0 654733
187 │ GW_Lane5_ST1 GW_Lane5_ST1 ST plumb plumb 142.0 606242
188 │ GW_Lane5_ST12 GW_Lane5_ST12 ST plumb plumb 144.0 691266
189 │ GW_Lane5_ST3 GW_Lane5_ST3 ST plumb plumb 143.0 696992
190 │ GW_Lane5_STvi1 GW_Lane5_STvi1 ST_vi vir vir 22.0 690090
191 │ GW_Lane5_STvi2 GW_Lane5_STvi2 ST_vi vir vir 23.0 897337
192 │ GW_Lane5_STvi3 GW_Lane5_STvi3 ST_vi vir vir 24.0 768162
193 │ GW_Lane5_TA1 GW_Lane5_TA1 TA plumb plumb 86.0 711909
194 │ GW_Lane5_TL1 GW_Lane5_TL1 TL vir vir 9.0 743509
195 │ GW_Lane5_TL10 GW_Lane5_TL10 TL vir vir 17.0 669934
196 │ GW_Lane5_TL11 GW_Lane5_TL11 TL vir vir 18.0 638402
197 │ GW_Lane5_TL12 GW_Lane5_TL12 TL vir vir 19.0 585697
198 │ GW_Lane5_TL2 GW_Lane5_TL2 TL vir vir 10.0 770857
199 │ GW_Lane5_TL4 GW_Lane5_TL4 TL vir vir 12.0 758037
200 │ GW_Lane5_TL5 GW_Lane5_TL5 TL vir vir 13.0 867165
201 │ GW_Lane5_TL7 GW_Lane5_TL7 TL vir vir 14.0 803407
202 │ GW_Lane5_TL8 GW_Lane5_TL8 TL vir vir 15.0 698745
203 │ GW_Lane5_TL9 GW_Lane5_TL9 TL vir vir 16.0 606969
204 │ GW_Lane5_TU1 GW_Lane5_TU1 TU nit nit 35.0 793640
205 │ GW_Lane5_TU2 GW_Lane5_TU2 TU nit nit 36.0 736785
206 │ GW_Lane5_UY2 GW_Lane5_UY2 UY plumb plumb 88.0 729002
207 │ GW_Lane5_UY3 GW_Lane5_UY3 UY plumb plumb 89.0 677527
208 │ GW_Lane5_UY4 GW_Lane5_UY4 UY plumb plumb 90.0 749848
209 │ GW_Lane5_UY5 GW_Lane5_UY5 UY plumb plumb 91.0 718373
210 │ GW_Lane5_UY6 GW_Lane5_UY6 UY plumb plumb 92.0 713675
211 │ GW_Lane5_YK1 GW_Lane5_YK1 YK vir vir 1.0 831245
212 │ GW_Lane5_YK11 GW_Lane5_YK11 YK vir vir 8.0 730798
213 │ GW_Lane5_YK3 GW_Lane5_YK3 YK vir vir 2.0 731944
214 │ GW_Lane5_YK4 GW_Lane5_YK4 YK vir vir 3.0 740051
215 │ GW_Lane5_YK5 GW_Lane5_YK5 YK vir vir 4.0 738740
216 │ GW_Lane5_YK6 GW_Lane5_YK6 YK vir vir 5.0 697420
217 │ GW_Lane5_YK7 GW_Lane5_YK7 YK vir vir 6.0 692052
218 │ GW_Lane5_YK9 GW_Lane5_YK9 YK vir vir 7.0 768722
219 │ GW_Liz_GBS_Liz10045 GW_Liz_GBS_Liz10045 ML lud lud_ML 51.01 818407
220 │ GW_Liz_GBS_Liz10094 GW_Liz_GBS_Liz10094 ML lud lud_ML 51.02 905297
221 │ GW_Liz_GBS_Liz5144 GW_Liz_GBS_Liz5144 ML lud lud_ML 51.08 942026
222 │ GW_Liz_GBS_Liz5163 GW_Liz_GBS_Liz5163 ML lud_chick lud_ML 51.12 851723
223 │ GW_Liz_GBS_Liz5164 GW_Liz_GBS_Liz5164 ML lud_chick lud_ML 51.13 866612
224 │ GW_Liz_GBS_Liz5165 GW_Liz_GBS_Liz5165 ML lud lud_ML 51.14 939782
225 │ GW_Liz_GBS_Liz5167 GW_Liz_GBS_Liz5167 ML lud_chick lud_ML 51.15 932442
226 │ GW_Liz_GBS_Liz5168 GW_Liz_GBS_Liz5168 ML lud_chick lud_ML 51.16 944373
227 │ GW_Liz_GBS_Liz5173 GW_Liz_GBS_Liz5173 ML lud_chick lud_ML 51.2 961268
228 │ GW_Liz_GBS_Liz5175 GW_Liz_GBS_Liz5175 ML lud lud_ML 51.22 947557
229 │ GW_Liz_GBS_Liz5178 GW_Liz_GBS_Liz5178 ML lud_chick lud_ML 51.25 942983
230 │ GW_Liz_GBS_Liz5179 GW_Liz_GBS_Liz5179 ML lud_chick lud_ML 51.26 933796
231 │ GW_Liz_GBS_Liz5182 GW_Liz_GBS_Liz5182 ML lud_chick lud_ML 51.28 950069
232 │ GW_Liz_GBS_Liz5184 GW_Liz_GBS_Liz5184 ML lud_chick lud_ML 51.29 875068
233 │ GW_Liz_GBS_Liz5185 GW_Liz_GBS_Liz5185 ML lud lud_ML 51.3 948876
234 │ GW_Liz_GBS_Liz5188 GW_Liz_GBS_Liz5188 ML lud lud_ML 51.33 853346
235 │ GW_Liz_GBS_Liz5189 GW_Liz_GBS_Liz5189 ML lud_chick lud_ML 51.34 972180
236 │ GW_Liz_GBS_Liz5190 GW_Liz_GBS_Liz5190 ML lud_chick lud_ML 51.35 922566
237 │ GW_Liz_GBS_Liz5191 GW_Liz_GBS_Liz5191 ML lud_chick lud_ML 51.36 836413
238 │ GW_Liz_GBS_Liz5193 GW_Liz_GBS_Liz5193 ML lud_chick lud_ML 51.38 895586
239 │ GW_Liz_GBS_Liz5194 GW_Liz_GBS_Liz5194 ML lud_chick lud_ML 51.39 900099
240 │ GW_Liz_GBS_Liz5197 GW_Liz_GBS_Liz5197 ML lud lud_ML 51.41 808093
241 │ GW_Liz_GBS_Liz5199 GW_Liz_GBS_Liz5199 ML lud_chick lud_ML 51.42 953161
242 │ GW_Liz_GBS_Liz6002 GW_Liz_GBS_Liz6002 ML lud lud_ML 51.43 903717
243 │ GW_Liz_GBS_Liz6006 GW_Liz_GBS_Liz6006 ML lud lud_ML 51.44 872123
244 │ GW_Liz_GBS_Liz6008 GW_Liz_GBS_Liz6008 ML lud lud_ML 51.45 932032
245 │ GW_Liz_GBS_Liz6009 GW_Liz_GBS_Liz6009 ML lud lud_ML 51.46 946736
246 │ GW_Liz_GBS_Liz6010 GW_Liz_GBS_Liz6010 ML lud lud_ML 51.47 844804
247 │ GW_Liz_GBS_Liz6014 GW_Liz_GBS_Liz6014 ML lud lud_ML 51.49 828632
248 │ GW_Liz_GBS_Liz6055 GW_Liz_GBS_Liz6055 ML lud lud_ML 51.5 878978
249 │ GW_Liz_GBS_Liz6057 GW_Liz_GBS_Liz6057 ML lud lud_ML 51.51 850572
250 │ GW_Liz_GBS_Liz6060 GW_Liz_GBS_Liz6060 ML lud lud_ML 51.52 865228
251 │ GW_Liz_GBS_Liz6062 GW_Liz_GBS_Liz6062 ML lud lud_ML 51.53 834482
252 │ GW_Liz_GBS_Liz6063 GW_Liz_GBS_Liz6063 ML lud lud_ML 51.54 879233
253 │ GW_Liz_GBS_Liz6066 GW_Liz_GBS_Liz6066 ML lud lud_ML 51.55 876902
254 │ GW_Liz_GBS_Liz6072 GW_Liz_GBS_Liz6072 ML lud lud_ML 51.56 845266
255 │ GW_Liz_GBS_Liz6079 GW_Liz_GBS_Liz6079 ML lud lud_ML 51.57 820880
256 │ GW_Liz_GBS_Liz6204 GW_Liz_GBS_Liz6204 ML lud_chick lud_ML 51.59 840138
257 │ GW_Liz_GBS_Liz6461 GW_Liz_GBS_Liz6461 ML lud lud_ML 51.6 815684
258 │ GW_Liz_GBS_Liz6472 GW_Liz_GBS_Liz6472 ML lud lud_ML 51.61 907391
259 │ GW_Liz_GBS_Liz6478 GW_Liz_GBS_Liz6478 ML lud lud_ML 51.62 779125
260 │ GW_Liz_GBS_Liz6776 GW_Liz_GBS_Liz6776 ML lud lud_ML 51.64 962187
261 │ GW_Liz_GBS_Liz6794 GW_Liz_GBS_Liz6794 ML lud lud_ML 51.65 841253
Filter SNPs with too many missing genotypes:
# (remember that first column is arbitrary row number in input file)
missing_genotypes_per_SNP = sum(geno_indFiltered .== -1, dims=1)
missing_genotypes_percent_allowed_per_site = 5 # this is the percentage threshold
threshold_genotypes_missing = size(geno_indFiltered)[1] * missing_genotypes_percent_allowed_per_site/100
selection = vec(missing_genotypes_per_SNP .<= threshold_genotypes_missing)
geno_ind_SNP_filtered = geno_indFiltered[:, selection]
pos_SNP_filtered = pos_whole_genome[selection[Not(1)],:] # the Not(1) is needed because first column in geno is arbitrary row number
println("Started with ", size(geno_indFiltered, 2)-1, " SNPs.
After filtering SNPs for no more than ", missing_genotypes_percent_allowed_per_site, "% missing genotypes, ", size(geno_ind_SNP_filtered, 2)-1, " SNPs remain." )Started with 2431709 SNPs.
After filtering SNPs for no more than 5% missing genotypes, 1017581 SNPs remain.
2nd round of filtering individuals
I added this in August 2023, to improve accuracy of imputation-based PCA, because I noticed outliers tended to have more missing data. Now I only allow up to 10% missing SNPs per individual.
SNPmissing_percent_allowed_per_ind_round2 = 10 # this is the percentage threshold
threshold_missing = (size(geno_ind_SNP_filtered, 2) - 1) * SNPmissing_percent_allowed_per_ind_round2/100
numMissings = sum(geno_ind_SNP_filtered .== -1, dims=2)
selection = vec(numMissings .<= threshold_missing) # the vec command converts to BitVector rather than BitMatrix--important below
geno_ind_SNP_ind_filtered = geno_ind_SNP_filtered[selection, :]
println("Filtering out these individuals based on too many missing genotypes: ")
filtered_inds = ind_with_metadata_indFiltered.ind[selection.==false]
println(DataFrame(filtered_inds = filtered_inds)) # did this to print all lines
ind_with_metadata_indFiltered = ind_with_metadata_indFiltered[selection, :]
println("This leaves ", size(geno_ind_SNP_ind_filtered, 1), " individuals and ", size(geno_ind_SNP_ind_filtered, 2)-1, " loci,
with no individuals missing more than ", SNPmissing_percent_allowed_per_ind_round2, "% of genotypes
and no loci missing in more than ", missing_genotypes_percent_allowed_per_site, "% of individuals.")Filtering out these individuals based on too many missing genotypes:
4×1 DataFrame
Row │ filtered_inds
│ String
─────┼──────────────────────────
1 │ GW_Armando_plate1_TTGW74
2 │ GW_Armando_plate2_TTGW54
3 │ GW_Lane5_AA8
4 │ GW_Lane5_YK1
This leaves 257 individuals and 1017581 loci,
with no individuals missing more than 10% of genotypes
and no loci missing in more than 5% of individuals.
Remove the first column of the genotype matrix (which was an initial row number):
genosOnly_ind_SNP_ind_filtered = geno_ind_SNP_ind_filtered[:, Not(1)]257×1017581 Matrix{Int16}:
0 0 0 0 0 1 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮
0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 -1 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Polish a few individual names (for more readable graphs):
ind_with_metadata_indFiltered.ind = correctNames(ind_with_metadata_indFiltered.ind)
ind_with_metadata_indFiltered.ID = correctNames(ind_with_metadata_indFiltered.ID)257-element Vector{String}:
"GW_Armando_plate1_AB1"
"GW_Armando_plate1_JF07G02"
"GW_Armando_plate1_JF07G03"
"GW_Armando_plate1_JF07G04"
"GW_Armando_plate1_JF08G02"
"GW_Armando_plate1_JF09G01"
"GW_Armando_plate1_JF09G02"
"GW_Armando_plate1_JF10G03"
"GW_Armando_plate1_JF11G01"
"GW_Armando_plate1_JF12G01"
"GW_Armando_plate1_JF12G02"
"GW_Armando_plate1_JF12G04"
"GW_Armando_plate1_JF13G01"
⋮
"GW_Liz_GBS_Liz6060"
"GW_Liz_GBS_Liz6062"
"GW_Liz_GBS_Liz6063"
"GW_Liz_GBS_Liz6066"
"GW_Liz_GBS_Liz6072"
"GW_Liz_GBS_Liz6079"
"GW_Liz_GBS_Liz6204"
"GW_Liz_GBS_Liz6461"
"GW_Liz_GBS_Liz6472"
"GW_Liz_GBS_Liz6478"
"GW_Liz_GBS_Liz6776"
"GW_Liz_GBS_Liz6794"
Calculate distances around ring
The locations around the ring (assuming barrier in North) can be graphed against genomic PC1 (or other variables).
Load lat/long data
cd(repoDirectory)
latlong_filepath = "metadata/GW_locations_LatLong_2023.txt"
latlongs = DataFrame(CSV.File(latlong_filepath))
print(latlongs)37×5 DataFrame
Row │ Location_name location_short lat_N long_E subspecies
│ String31 String7 Float64 Float64 String15
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ Yekaterinburg YK 56.8 60.6 viridanus
2 │ Abakan AB 52.0 89.5 viridanus
3 │ Teletsk TL 51.7 87.6 viridanus
4 │ Verkh_Biryusa_vir VB_vi 55.9 92.0 viridanus
5 │ Divnogorsk_vir DV_vi 56.0 92.4 viridanus
6 │ Stolbi_vir ST_vi 55.9 92.6 viridanus
7 │ Turkey TU 41.0 42.0 nitidus
8 │ Ala_Archa AA 42.54 74.5 viridanus
9 │ Naran_Pakistan NR 34.884 73.691 ludlowi
10 │ Shogran_Pakistan SH 34.594 73.466 ludlowi
11 │ Overa_Kashmir OV 33.991 75.243 ludlowi
12 │ Satharundhi_ChambaDistrict SA 32.974 76.222 ludlowi
13 │ KL_Killar_HP KL 33.106 76.409 ludlowi
14 │ Thalighar TH 32.828 76.45 ludlowi
15 │ Sural SR 33.134 76.455 ludlowi
16 │ PA_Tindi_HP PA 32.771 76.472 ludlowi
17 │ Sukhto SU 32.868 76.855 ludlowi
18 │ Nainaghar NG 32.728 76.8594 ludlowi
19 │ Mooling_and_Keylong ML 32.508 76.981 ludlowi
20 │ Manali MN 32.237 77.13 ludlowi
21 │ Spiti SP 32.377 77.281 ludlowi
22 │ Langtang LN 28.25 85.5 trochiloides
23 │ Gongga GG 29.5 102.0 obscuratus
24 │ Emeishan EM 29.5 103.3 obscuratus
25 │ Xining XN 37.0 102.0 obscuratus
26 │ Beijing BJ 40.0 115.5 plumbeitarsus
27 │ Baikal BK 51.9 104.9 plumbeitarsus
28 │ Arshan AN 51.9 102.5 plumbeitarsus
29 │ Ilinka IL 51.1 95.5 plumbeitarsus
30 │ Tuva TA 51.3 92.0 plumbeitarsus
31 │ Uyukski UY 51.9 94.1 plumbeitarsus
32 │ Manskoe_Belogorie MB 54.7 94.0 plumbeitarsus
33 │ Verkh_Biryusa VB 55.9 92.0 plumbeitarsus
34 │ Divnogorsk DV 56.0 92.4 plumbeitarsus
35 │ Stolbi ST 55.9 92.6 plumbeitarsus
36 │ Predivinsk PR 57.1 93.5 plumbeitarsus
37 │ Solgonski SL 55.7 91.0 plumbeitarsus
Make a quick plot to inspect latlong data:
f = CairoMakie.Figure()
ax = Axis(f[1, 1],
title = "Research locations",
xlabel = "longitude (E)",
ylabel = "latitude (N)"
)
scatter!(latlongs.long_E, latlongs.lat_N)
text!(latlongs.long_E, latlongs.lat_N; text = latlongs.location_short)
f┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
remove “Green warbler” nitidus
Phylloscopus [t.] nitidus is outside of the main ring, so remove these samples from this analysis:
latlongs2 = latlongs[Not(latlongs.subspecies .== "nitidus"), :];
print(latlongs2)36×5 DataFrame
Row │ Location_name location_short lat_N long_E subspecies
│ String31 String7 Float64 Float64 String15
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ Yekaterinburg YK 56.8 60.6 viridanus
2 │ Abakan AB 52.0 89.5 viridanus
3 │ Teletsk TL 51.7 87.6 viridanus
4 │ Verkh_Biryusa_vir VB_vi 55.9 92.0 viridanus
5 │ Divnogorsk_vir DV_vi 56.0 92.4 viridanus
6 │ Stolbi_vir ST_vi 55.9 92.6 viridanus
7 │ Ala_Archa AA 42.54 74.5 viridanus
8 │ Naran_Pakistan NR 34.884 73.691 ludlowi
9 │ Shogran_Pakistan SH 34.594 73.466 ludlowi
10 │ Overa_Kashmir OV 33.991 75.243 ludlowi
11 │ Satharundhi_ChambaDistrict SA 32.974 76.222 ludlowi
12 │ KL_Killar_HP KL 33.106 76.409 ludlowi
13 │ Thalighar TH 32.828 76.45 ludlowi
14 │ Sural SR 33.134 76.455 ludlowi
15 │ PA_Tindi_HP PA 32.771 76.472 ludlowi
16 │ Sukhto SU 32.868 76.855 ludlowi
17 │ Nainaghar NG 32.728 76.8594 ludlowi
18 │ Mooling_and_Keylong ML 32.508 76.981 ludlowi
19 │ Manali MN 32.237 77.13 ludlowi
20 │ Spiti SP 32.377 77.281 ludlowi
21 │ Langtang LN 28.25 85.5 trochiloides
22 │ Gongga GG 29.5 102.0 obscuratus
23 │ Emeishan EM 29.5 103.3 obscuratus
24 │ Xining XN 37.0 102.0 obscuratus
25 │ Beijing BJ 40.0 115.5 plumbeitarsus
26 │ Baikal BK 51.9 104.9 plumbeitarsus
27 │ Arshan AN 51.9 102.5 plumbeitarsus
28 │ Ilinka IL 51.1 95.5 plumbeitarsus
29 │ Tuva TA 51.3 92.0 plumbeitarsus
30 │ Uyukski UY 51.9 94.1 plumbeitarsus
31 │ Manskoe_Belogorie MB 54.7 94.0 plumbeitarsus
32 │ Verkh_Biryusa VB 55.9 92.0 plumbeitarsus
33 │ Divnogorsk DV 56.0 92.4 plumbeitarsus
34 │ Stolbi ST 55.9 92.6 plumbeitarsus
35 │ Predivinsk PR 57.1 93.5 plumbeitarsus
36 │ Solgonski SL 55.7 91.0 plumbeitarsus
Make a matrix of great circle distances
These are Haversine distances, assuming spherical Earth which is really close:
geoPoints = GeoLocation.(latlongs2.long_E, latlongs2.lat_N)
# this next line is so neat--uses list comprehension to make a matrix of pairwise calculations
distances = [(HaversineDistance(geoPoints[i], geoPoints[j])/1000) for i in eachindex(geoPoints), j in eachindex(geoPoints)]36×36 Matrix{Float64}:
0.0 1929.03 1829.5 1920.28 … 1956.21 1976.01 1866.47
1929.03 0.0 134.698 463.414 478.643 622.753 422.996
1829.5 134.698 0.0 548.935 570.583 711.032 497.776
1920.28 463.414 548.935 0.0 37.404 162.101 66.3387
1941.17 483.376 572.19 27.2735 16.6941 139.661 93.5371
1956.21 478.643 570.583 37.404 … 0.0 144.411 102.442
1866.88 1539.57 1417.12 1943.75 1971.61 2101.68 1883.0
2629.03 2280.66 2174.21 2720.63 2744.17 2882.26 2664.81
2653.46 2318.81 2212.24 2758.56 2782.15 2920.16 2702.67
2768.57 2304.58 2204.89 2753.66 2775.3 2915.75 2700.28
2905.38 2370.69 2276.33 2825.06 … 2845.24 2987.08 2773.58
2897.44 2350.4 2256.5 2805.15 2825.21 2967.15 2753.85
2927.82 2377.48 2284.21 2832.75 2852.64 2994.71 2781.67
⋮ ⋱ ⋮
4317.27 2390.71 2499.26 2464.31 2434.21 2474.86 2502.82
2869.29 1053.52 1187.02 953.003 … 918.603 934.047 1003.57
2724.34 889.835 1023.03 817.872 785.302 818.533 863.881
2341.58 426.63 551.809 581.592 567.027 679.716 591.859
2117.8 189.217 307.752 511.497 513.021 652.226 493.694
2214.04 315.404 447.369 465.516 455.478 579.504 468.914
2082.09 423.338 540.944 183.922 … 160.175 268.68 220.448
1920.28 463.414 548.935 0.0 37.404 162.101 66.3387
1941.17 483.376 572.19 27.2735 16.6941 139.661 93.5371
1956.21 478.643 570.583 37.404 0.0 144.411 102.442
1976.01 622.753 711.032 162.101 144.411 0.0 218.833
1866.47 422.996 497.776 66.3387 … 102.442 218.833 0.0
Now adjust distances to assume no gene flow through centre of ring.
# get some key distances
function getIndex(name; nameVector = latlongs2.Location_name)
findfirst(isequal(name), nameVector)
end
index_AA = getIndex("Ala_Archa")
index_NR = getIndex("Naran_Pakistan")
index_LN = getIndex("Langtang")
index_GG = getIndex("Gongga")
index_XN = getIndex("Xining")
index_BJ = getIndex("Beijing")
index_last = nrow(latlongs2)
dist_NR_to_LN = distances[index_NR, index_LN]
dist_LN_to_GG = distances[index_LN, index_GG]
dist_GG_to_BJ = distances[index_GG, index_BJ]
# This next part will assume locations in the input file are arranged in order around ring:
distsAroundRing = Matrix{Float32}(undef, size(distances)[1], size(distances)[2])
# accept all distances within viridanus:
# function for accepting straight-line great circle dists as distances between sets of sites
acceptDists = function(straightGreatCircleDists, start, finish, distsAroundRing)
distsAroundRing[start:finish, start:finish] = straightGreatCircleDists[start:finish, start:finish]
return(distsAroundRing)
end
# accept all distances within viridanus:
distsAroundRing = acceptDists(distances, 1, index_AA, distsAroundRing)
# accept dist from AA to NR:
distsAroundRing = acceptDists(distances, index_AA, index_NR, distsAroundRing)
# accept all distances from NR to LN:
distsAroundRing = acceptDists(distances, index_NR, index_LN, distsAroundRing)
# accept dist from LN to GG:
distsAroundRing = acceptDists(distances, index_LN, index_GG, distsAroundRing)
# accept dists between GG, EM, XN, BJ:
distsAroundRing = acceptDists(distances, index_GG, index_BJ, distsAroundRing)
# accept all distances within plumbeitarsus:
distsAroundRing = acceptDists(distances, index_BJ, index_last, distsAroundRing)
# function for adding up distances measured through certain sites:
addDists = function(set1start, set1end, set2start, set2end, distsAroundRing)
firstDists = repeat(distsAroundRing[set1start:(set1end-1), set1end], 1, set2end-set2start+1)
secondDists = repeat(transpose(distsAroundRing[set1end, set2start:set2end]), set1end-set1start, 1)
totalDists = firstDists + secondDists
distsAroundRing[set1start:(set1end-1), set2start:set2end] = totalDists
distsAroundRing[set2start:set2end, set1start:(set1end-1)] = transpose(totalDists)
return(distsAroundRing)
end
# dists from viridanus to NR are sum of dists to AA plus AA to NR:
distsAroundRing = addDists(1, index_AA, index_NR, index_NR, distsAroundRing)
# dists from "northwest of PK" to Himalayas are sum of ringdists to NR plus NR to locations up to LN:
distsAroundRing = addDists(1, index_NR, index_NR+1, index_LN, distsAroundRing)
# dists from "west / northwest of LN" to GG are sum of dists to LN plus LN to EM:
distsAroundRing = addDists(1, index_LN, index_GG, index_GG, distsAroundRing)
# dists from "west / northwest of EM" to China are sum of dists to GG plus GG to (EM, XN, BJ):
distsAroundRing = addDists(1, index_GG, index_GG, index_BJ, distsAroundRing)
# dists from "west of BJ" to east Siberia are sum of dists to BJ plus BJ to other plumbeitarsus:
distsAroundRing = addDists(1, index_BJ, index_BJ+1, index_last, distsAroundRing);Do Principal Coordinates Analysis on the distances around the ring
This produces a single location axis around ring, going from west Siberia south, then east, then north to east Siberia.
PCO_around_ring = fit(MDS, distsAroundRing; distances=true, maxoutdim=1)
# add this as a column to the data frame:
latlongs2.LocationAroundRing = vec(-predict(PCO_around_ring))
# another way:
# latlongs2[:, :LocationAroundRing] = vec(-predict(PCO_around_ring))
latlongs2[:, [:location_short, :LocationAroundRing]]
println(latlongs2[:, [:location_short, :LocationAroundRing]])36×2 DataFrame
Row │ location_short LocationAroundRing
│ String7 Float32
─────┼────────────────────────────────────
1 │ YK -4799.38
2 │ AB -4548.43
3 │ TL -4428.78
4 │ VB_vi -4953.25
5 │ DV_vi -4978.83
6 │ ST_vi -4980.5
7 │ AA -3027.02
8 │ NR -2171.62
9 │ SH -2166.63
10 │ OV -1998.24
11 │ SA -1861.95
12 │ KL -1854.67
13 │ TH -1835.41
14 │ SR -1852.63
15 │ PA -1830.4
16 │ SU -1805.38
17 │ NG -1796.96
18 │ ML -1774.64
19 │ MN -1747.34
20 │ SP -1743.06
21 │ LN -830.651
22 │ GG 783.671
23 │ EM 890.416
24 │ XN 1481.8
25 │ BJ 2480.12
26 │ BK 4027.02
27 │ AN 4134.32
28 │ IL 4451.78
29 │ TA 4673.03
30 │ UY 4581.05
31 │ MB 4762.52
32 │ VB 4943.39
33 │ DV 4929.77
34 │ ST 4913.15
35 │ PR 4951.66
36 │ SL 4982.05
Add these ring locations to the metadata table:
ind_with_metadata_indFiltered.ring_km .= NaN # pre-allocate the column
for i in axes(latlongs2, 1)
match_indices = findall(ind_with_metadata_indFiltered.location .== latlongs2.location_short[i])
ind_with_metadata_indFiltered.ring_km[match_indices] .= latlongs2.LocationAroundRing[i];
endMake a table of locations and groups of individuals included after filtering
gdf = groupby(ind_with_metadata_indFiltered, [:location, :Fst_group])
sample_origin_summary = combine(gdf, nrow)
println(sample_origin_summary)
# to save as comma-delimited text:
# CSV.write("GW2023_sample_origin_summary.txt", sample_origin_summary; delim='\t')37×3 DataFrame
Row │ location Fst_group nrow
│ String7 String15 Int64
─────┼──────────────────────────────
1 │ AB vir 2
2 │ ST plumb 35
3 │ ST plumb_vir 1
4 │ ST_vi vir 6
5 │ DV plumb 5
6 │ DV_vi vir 2
7 │ MB plumb 4
8 │ VB plumb 5
9 │ PR plumb 5
10 │ BJ plumb_BJ 3
11 │ TL vir 11
12 │ MN troch_west 10
13 │ SU lud_central 6
14 │ TH lud_central 7
15 │ SR lud_central 18
16 │ NG lud_central 9
17 │ SP troch_west 10
18 │ SA lud_Sath 8
19 │ UY plumb 6
20 │ VB_vi vir 2
21 │ LN troch_LN 14
22 │ AA vir_S 8
23 │ AN plumb 2
24 │ BK plumb 2
25 │ XN obs 4
26 │ EM troch_EM 1
27 │ IL plumb 2
28 │ OV lud_KS 2
29 │ NR lud_PK 2
30 │ KL lud_central 3
31 │ ML lud_ML 44
32 │ PA lud_central 2
33 │ SH lud_PK 4
34 │ SL plumb 2
35 │ TA plumb 1
36 │ TU nit 2
37 │ YK vir 7
Save the filtered dataset
cd(dataDirectory)
filename = string(baseName, tagName, "ind_SNP_ind_filtered.jld2")
jldsave(filename; genosOnly_ind_SNP_ind_filtered, ind_with_metadata_indFiltered, pos_SNP_filtered)
println("Saved the filtered data.")Saved the filtered data.
HERE WILL PASS THE DATA OFF TO OTHER QUARTO PAGES, TO LOAD DATA AS BELOW:
Load the filtered dataset
filename = string(baseName, tagName, "ind_SNP_ind_filtered.jld2")
genosOnly_ind_SNP_ind_filtered = load(filename, "genosOnly_ind_SNP_ind_filtered")
ind_with_metadata_indFiltered = load(filename, "ind_with_metadata_indFiltered")
pos_SNP_filtered = load(filename, "pos_SNP_filtered")
println("Loaded the filtered data.")Prepare data for Genotype-by-individual plots and PCA
For missing genotypes, change our code of -1 to missing:
genosOnly_with_missing = Matrix{Union{Missing, Int16}}(genosOnly_ind_SNP_ind_filtered)
genosOnly_with_missing[genosOnly_with_missing .== -1] .= missing;Genotype-by-individual plots
Now, show individual genotypes for subsets of the dataset. Can choose individuals and genomic regions to plot, along with an Fst cutoff (only show SNPs with greater Fst than the cutoff).
groups = ["vir","plumb"]
plotGroups = ["vir","plumb_vir","plumb"]
plotGroupColors = ["blue","purple","red"]
numIndsToPlot = [100,100,100] # maximum number of individuals to plot from each group
group1 = "vir" # these groups will determine the color used in the graph
group2 = "plumb"
groupsToCompare = "vir_plumb"
Fst_cutoff = 0.8
missingFractionAllowed = 0.2 # only show SNPs with less than this fraction of missing data among individuals0.2
Calculate allele freqs and sample sizes (use column Fst_group)
freqs, sampleSizes = getFreqsAndSampleSizes(genosOnly_with_missing, ind_with_metadata_indFiltered.Fst_group, groups)
println("Calculated population allele frequencies and sample sizes")Calculated population allele frequencies and sample sizes
calculate Fst
Fst, FstNumerator, FstDenominator, pairwiseNamesFst = getFst(freqs, sampleSizes, groups; among=true) # set among to FALSE if no among Fst wanted (some things won't work without it)
println("Calculated Fst values")Calculated Fst values
limit the individuals to include in plot
# For this figure only, filter out individuals with lots of missing genotypes
numMissings_threshold = 800_000
selection = ind_with_metadata_indFiltered.numMissings .< numMissings_threshold
genosOnly_with_missing_selected = view(genosOnly_with_missing, selection, :)
ind_with_metadata_indFiltered_selected = view(ind_with_metadata_indFiltered, selection, :)
# now limit each group to specified numbers
genosOnly_included, ind_with_metadata_included = limitIndsToPlot(plotGroups, numIndsToPlot, genosOnly_with_missing_selected, ind_with_metadata_indFiltered_selected);Choose the scaffold and region to show
chr = "gw23"
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr]) # this gets the maximum position for the chromosome("gw23", 1, 8435933, "chr gw23 1 to 8435933")
Now actually make the plot
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff, missingFractionAllowed,
regionInfo, pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_included, ind_with_metadata_included, freqs, plotGroups, plotGroupColors; plotTitle = "");
# plotInfo contains a tuple with: (f, plottedGenotype, locations, plottedMetadata)
if false # set to true to save plot
save("Figure4_upperleft_gw23GBIplot_from_Julia.png", plotInfo[1], px_per_unit = 2.0)
end ┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
choose another chromosome, and plot similarly to above
chr = "gw26"
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr]) # this gets the maximum position for the chromosome
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff, missingFractionAllowed,
regionInfo, pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_included, ind_with_metadata_included, freqs, plotGroups, plotGroupColors; plotTitle = "");
if false # set to true to save plot
save("Figure4_upperright_gw26GBIplot_from_Julia.png", plotInfo[1], px_per_unit = 2.0)
end ┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
Make a GBI plot to illustrate variation along west side of ring
groups = ["vir","troch_LN"] # for purpose of calculating pairwise Fst and Fst_group (to determine SNPs)
plotGroups = ["vir","vir_S","nit", "lud_PK", "lud_KS", "lud_central", "lud_Sath", "lud_ML","troch_west","troch_LN"]
plotGroupColors = ["blue","turquoise1","grey","seagreen4","seagreen3","seagreen2","olivedrab3","olivedrab2","olivedrab1","yellow"]
numIndsToPlot = [10, 5, 2, 3, 5, 15, 3, 5, 10, 10] # maximum number of individuals to plot from each group
group1 = "vir" # these groups will determine the color used in the graph
group2 = "troch_LN"
groupsToCompare = "vir_troch_LN" # "Fst_among"
Fst_cutoff = 0.9
missingFractionAllowed = 0.2 # only show SNPs with less than this fraction of missing data among individuals
# Calculate allele freqs and sample sizes (use column Fst_group)
freqs, sampleSizes = getFreqsAndSampleSizes(genosOnly_with_missing, ind_with_metadata_indFiltered.Fst_group, groups)
println("Calculated population allele frequencies and sample sizes")
# calculate Fst
Fst, FstNumerator, FstDenominator, pairwiseNamesFst = getFst(freqs, sampleSizes, groups; among=true) # set among to FALSE if no among Fst wanted (some things won't work without it)
println("Calculated Fst values")
# limit the individuals to include in plot
# For this figure only, filter out individuals with lots of missing genotypes
numMissings_threshold = 800_000
selection = ind_with_metadata_indFiltered.numMissings .< numMissings_threshold
genosOnly_with_missing_selected = view(genosOnly_with_missing, selection, :)
ind_with_metadata_indFiltered_selected = view(ind_with_metadata_indFiltered, selection, :)
# now limit each group to specified numbers
genosOnly_included, ind_with_metadata_included = limitIndsToPlot(plotGroups, numIndsToPlot, genosOnly_with_missing_selected, ind_with_metadata_indFiltered_selected);
# choose the scaffold and region to show
chr = "gw26"
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr])
#### Now actually make the plot
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff, missingFractionAllowed,
regionInfo, pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_included, ind_with_metadata_included, freqs, plotGroups, plotGroupColors; plotTitle = "");
if false # set to true to save plot
save("Figure5_left_gw26GBIplotWest_from_Julia.png", plotInfo[1], px_per_unit = 2.0)
end Calculated population allele frequencies and sample sizes
Calculated Fst values
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
Show another chromosome along west side of ring:
# choose the scaffold and region to show
chr = "gw28"
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr])
#### Now actually make the plot
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff, missingFractionAllowed,
regionInfo, pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_included, ind_with_metadata_included, freqs, plotGroups, plotGroupColors; plotTitle = "");
if false # set to true to save plot
save("Figure5_right_gw28GBIplotWest_from_Julia.png", plotInfo[1], px_per_unit = 2.0)
end ┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
Make a GBI plot to illustrate variation along east side of ring
groups = ["troch_LN","obs","plumb"] # for purpose of calculating pairwise Fst and Fst_group (to determine SNPs)
plotGroups = ["troch_LN","troch_EM","obs","plumb_BJ","plumb"]
plotGroupColors = ["yellow","gold","orange","pink","red"]
numIndsToPlot = [15, 15, 15, 15, 17] # maximum number of individuals to plot from each group
group1 = "troch_LN" # these groups will determine the color used in the graph
group2 = "plumb"
groupsToCompare = "troch_LN_plumb" #"troch_LN_plumb" #"Fst_among"
Fst_cutoff = 0.9
missingFractionAllowed = 0.2 # only show SNPs with less than this fraction of missing data among individuals
# Calculate allele freqs and sample sizes (use column Fst_group)
freqs, sampleSizes = getFreqsAndSampleSizes(genosOnly_with_missing, ind_with_metadata_indFiltered.Fst_group, groups)
println("Calculated population allele frequencies and sample sizes")
# calculate Fst
Fst, FstNumerator, FstDenominator, pairwiseNamesFst = getFst(freqs, sampleSizes, groups; among=true) # set among to FALSE if no among Fst wanted (some things won't work without it)
println("Calculated Fst values")
# limit the individuals to include in plot
# For this figure only, filter out individuals with lots of missing genotypes
numMissings_threshold = 800_000
selection = ind_with_metadata_indFiltered.numMissings .< numMissings_threshold
genosOnly_with_missing_selected = view(genosOnly_with_missing, selection, :)
ind_with_metadata_indFiltered_selected = view(ind_with_metadata_indFiltered, selection, :)
# now limit each group to specified numbers
genosOnly_included, ind_with_metadata_included = limitIndsToPlot(plotGroups, numIndsToPlot, genosOnly_with_missing_selected, ind_with_metadata_indFiltered_selected);
# choose the scaffold and region to show
chr = "gw28"
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr]) # this gets the maximum position for the chromosome
# need to remove two plumbeitarsus individuals that have introgression from viridanus, as that would otherwise be misleading in this figure since viridanus are not shown
selection = Not(ind_with_metadata_included.ind .∈ Ref(["GW_Armando_plate1_JF09G01", "GW_Armando_plate1_JF24G02"]))
ind_with_metadata_included = ind_with_metadata_included[selection,:]
genosOnly_included = genosOnly_included[selection,:]
#### Now actually make the plot
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff, missingFractionAllowed,
regionInfo, pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_included, ind_with_metadata_included, freqs, plotGroups, plotGroupColors; plotTitle = "");
if false # set to true to save plot
save("Figure6_top_gw28GBIplotEast_from_Julia.png", plotInfo[1], px_per_unit = 2.0)
end Calculated population allele frequencies and sample sizes
Calculated Fst values
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
Make GBI plots showing all individuals in the study
groups_to_plot_all = ["vir","vir_S","nit", "lud_PK", "lud_KS", "lud_central", "lud_Sath", "lud_ML","troch_west","troch_LN","troch_EM","obs","plumb_BJ","plumb","plumb_vir"]
group_colors_all = ["blue","turquoise1","grey","seagreen4","seagreen3","seagreen2","olivedrab3","olivedrab2","olivedrab1","yellow","gold","orange","pink","red","purple"];This will use Fst among multiple groups to determine SNPs to plot. For chr 28:
groups = ["vir","lud_PK","troch_LN","obs","plumb"] # for purpose of calculating pairwise Fst and Fst_group (to determine SNPs)
plotGroups = groups_to_plot_all
plotGroupColors = group_colors_all
group1 = "vir" # these groups will determine the color used in the graph
group2 = "plumb"
groupsToCompare = ["vir_plumb", "vir_troch_LN", "troch_LN_plumb"] # "Fst_among" #"vir_plumb"
Fst_cutoff = 0.8
missingFractionAllowed = 0.2 # only show SNPs with less than this fraction of missing data among individuals
# Calculate allele freqs and sample sizes (use column Fst_group)
freqs, sampleSizes = getFreqsAndSampleSizes(genosOnly_with_missing, ind_with_metadata_indFiltered.Fst_group, groups)
println("Calculated population allele frequencies and sample sizes")
Fst, FstNumerator, FstDenominator, pairwiseNamesFst = getFst(freqs, sampleSizes, groups; among=true) # set among to FALSE if no among Fst wanted (some things won't work without it)
println("Calculated Fst values")
chr = "gw28"
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr]) # this gets the maximum position for the chromosome
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff,
missingFractionAllowed, regionInfo,
pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_with_missing, ind_with_metadata_indFiltered, freqs,
plotGroups, plotGroupColors;
indFontSize=4, figureSize=(1200,1600),
plotTitle = "");
# plotInfo contains a tuple with: (f, plottedGenotype, locations, plottedMetadata)Calculated population allele frequencies and sample sizes
Calculated Fst values
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
Make list of scaffolds to plot according to above:
scaffolds_to_plot = "gw" .* string.(vcat(28:-1:17, 15:-1:1))
push!(scaffolds_to_plot, "gw1A", "gw4A") # add two other scaffolds29-element Vector{String}:
"gw28"
"gw27"
"gw26"
"gw25"
"gw24"
"gw23"
"gw22"
"gw21"
"gw20"
"gw19"
"gw18"
"gw17"
"gw15"
⋮
"gw10"
"gw9"
"gw8"
"gw7"
"gw6"
"gw5"
"gw4"
"gw3"
"gw2"
"gw1"
"gw1A"
"gw4A"
Loop through each scaffold and produce GBI plot. (Making inactive because already saved figs and takes long.)
for i in 1:length(scaffolds_to_plot)
chr = scaffolds_to_plot[i]
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr]) # this gets the maximum position for the chromosome
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff,
missingFractionAllowed, regionInfo,
pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_with_missing, ind_with_metadata_indFiltered, freqs,
plotGroups, plotGroupColors;
indFontSize=4, figureSize=(1200,1600), plotTitle = "");
println("Completed the figure for ", chr, ".")
if true # set to true to save plot
filename = string("Figure_", chr, "_Fst3groups_GBI_allInds_from_Julia.png")
save(filename, plotInfo[1], px_per_unit = 2.0)
println("Saved ", filename)
end
endEstimate relationships of individuals using PCA
Our goal is to produce plots showing individuals in genotype space, using Principal Components Analysis.
Impute and save genotypes for each scaffold
PCA requires imputation of missing genotypes. I did imputation for each scaffold above a certain size threshold. These are listed in chromosomes_to_process .
Imputation can take several minutes per scaffold, so I ran this imputation step separately from this Quarto notebook (otherwise render would take long) and saved the genotype data for each scaffold for loading in the next step. Below is the code I initially used for imputing, BUT IT USES THE SVD ALGORITHM AND I LATER DECIDED THAT KNN IS BETTER (BUT IN OCT2024 DECIDED SVD IS BEST HA HA, but it takes a lot longer–now changing my mind back, that K=1 and KNN is best because nitidus has only 2 inds):
julia # DON'T RUN THIS AS IT USES SVD IMPUTING (NOT KNN--SEE BELOW FOR THAT) #= for i in eachindex(chromosomes_to_process) chrom = chromosomes_to_process[i] regionText = string("chr", chrom) loci_selection = (pos_SNP_filtered.chrom .== chrom) pos_SNP_filtered_region = pos_SNP_filtered[loci_selection,:] genosOnly_region_for_imputing = Matrix{Union{Missing, Float32}}(genosOnly_with_missing[:,loci_selection]) @time imputed_genos = Impute.svd(genosOnly_region_for_imputing) filename = string(baseName, tagName, regionText, ".SVDimputedMissing.jld2") jldsave(filename; imputed_genos, ind_with_metadata_indFiltered, pos_SNP_filtered_region) println(string("Chromosome ", chrom, ": Saved real and imputed genotypes for ", size(pos_SNP_filtered_region, 1)," SNPs and ", size(genosOnly_region_for_imputing, 1)," filtered individuals.")) end =#
Now we can cycle through a set of chromosomes and plot a PCA for each. We need to first specify some groups to include in the plot, and their colors:
groups_to_plot_PCA = ["vir","vir_S","nit", "lud_PK", "lud_KS", "lud_central", "lud_Sath", "lud_ML","troch_west","troch_LN","troch_EM","obs","plumb_BJ","plumb","plumb_vir"]
group_colors_PCA = ["blue","turquoise1","grey","seagreen4","seagreen3","seagreen2","olivedrab3","olivedrab2","olivedrab1","yellow","gold","orange","pink","red","purple"];Now we’ll actually do the PCA and make the plot for each scaffold. The code block below is what I initially used, based on the SVD method of imputing. I’ve now decided that KNN is better (see further below)–CHANGED MIND AGAIN 1NOV2024–changed back to KNN and K=1 in Jan2025.
MAKING INACTIVE FOR NOW, BECAUSE MAKES A LOT OF PLOTS:
#= for i in eachindex(chromosomes_to_process)
chrom = chromosomes_to_process[i]
regionText = string("chr", chrom)
filename = string(baseName, tagName, regionText, ".SVDimputedMissing.jld2")
imputed_genos = load(filename, "imputed_genos")
ind_with_metadata_indFiltered = load(filename, "ind_with_metadata_indFiltered")
pos_SNP_filtered_region = load(filename, "pos_SNP_filtered_region")
println(string("Loaded ", filename))
println(string(regionText, ": ", size(imputed_genos, 2), " SNPs from ", size(imputed_genos, 1), " individuals"))
flipPC1 = true
flipPC2 = true
PCAmodel = plotPCA(imputed_genos, ind_with_metadata_indFiltered,
groups_to_plot_PCA, group_colors_PCA;
sampleSet="greenish warblers", regionText=regionText,
flip1=flipPC1, flip2=flipPC2,
showPlot=false)
# add position of reference genome
refGenomePCAposition = predict(PCAmodel.model, zeros(size(imputed_genos, 2)))
flipPC1 && (refGenomePCAposition[1] *= -1) # this flips PC1 if flipPC1 = true
flipPC2 && (refGenomePCAposition[2] *= -1) # same for PC2
CairoMakie.scatter!(refGenomePCAposition[1], refGenomePCAposition[2], marker=:diamond, color="black", markersize=15, strokewidth=0.5)
try
display(PCAmodel.PCAfig)
catch
println("NOTICE: Figure for ", regionText, " could not be shown due to an unknown error.")
end
end =#Imputation using KNN
Troyanskaya et al. (2001) recommend imputation using K-nearest neighbors approach as being better than SVD, both of which are better than other methods for DNA genotyping. They also recommend using Euclidian distance. So I will try KNN with Euclidian distance, which like SVD is provided by Impute.jl. I am going with setting dims to :rows, as that seems to run much faster and produces PCAs that make a lot of sense. I’ve already run this next code cell, which does the imputing and saves the imputed data matrix for each scaffold. This actually runs quite quickly–at most a couple minutes per scaffold. :)
for i in eachindex(chromosomes_to_process)
chrom = chromosomes_to_process[i]
regionText = string("chr", chrom)
loci_selection = (pos_SNP_filtered.chrom .== chrom)
pos_SNP_filtered_region = pos_SNP_filtered[loci_selection,:]
genosOnly_region_for_imputing = Matrix{Union{Missing, Float32}}(genosOnly_with_missing[:,loci_selection])
@time imputed_genos = Impute.knn(genosOnly_region_for_imputing; dims = :rows) # much faster with the "dims = :rows" there
filename = string(baseName, tagName, regionText, ".KNNimputedMissing.jld2")
jldsave(filename; imputed_genos, ind_with_metadata_indFiltered, pos_SNP_filtered_region)
println(string("Chromosome ", chrom, ": Saved real and imputed genotypes for ", size(pos_SNP_filtered_region, 1)," SNPs and ", size(genosOnly_region_for_imputing, 1)," filtered individuals."))
endNow do the KNN PCA:
MAKING INACTIVE FOR NOW, BECAUSE MAKES A LOT OF PLOTS:
for i in eachindex(chromosomes_to_process)
scaffold = chromosomes_to_process[i]
regionText = string("chr", scaffold)
filename = string(baseName, tagName, regionText, ".KNNimputedMissing.jld2")
imputed_genos = load(filename, "imputed_genos")
ind_with_metadata_indFiltered = load(filename, "ind_with_metadata_indFiltered")
pos_SNP_filtered_region = load(filename, "pos_SNP_filtered_region")
println(string("Loaded ",filename))
println(string(regionText, ": ", size(imputed_genos,2), " SNPs from ", size(imputed_genos,1), " individuals"))
flipPC1 = true
flipPC2 = true
PCAmodel = plotPCA(imputed_genos, ind_with_metadata_indFiltered,
groups_to_plot_PCA, group_colors_PCA;
sampleSet = "greenish warblers", regionText=regionText,
flip1 = flipPC1, flip2 = flipPC2,
lineOpacity = 0.7, fillOpacity = 0.6,
symbolSize = 14, showTitle = true,
xLabelText = string("Chromosome ", scaffold," PC1"), yLabelText = string("Chromosome ", scaffold," PC2"),
showPlot = false)
# add position of reference genome
refGenomePCAposition = predict(PCAmodel.model, zeros(size(imputed_genos, 2)))
flipPC1 && (refGenomePCAposition[1] *= -1) # this flips PC1 if flipPC1 = true
flipPC2 && (refGenomePCAposition[2] *= -1) # same for PC2
CairoMakie.scatter!(refGenomePCAposition[1], refGenomePCAposition[2], marker = :diamond, color="black", markersize=15, strokewidth=0.5)
try
display(PCAmodel.PCAfig)
catch
println("NOTICE: Figure for ", regionText, " could not be shown due to an unknown error.")
end
endMake GBI plots showing just west individuals:
groups = ["vir","lud_PK","troch_LN"] # for purpose of calculating pairwise Fst and Fst_group (to determine SNPs)
plotGroups = ["vir", "vir_misID", "vir_S", "nit", "lud_PK", "lud_KS", "lud_central", "lud_Sath", "lud_ML", "troch_west", "troch_LN"] # west GW individuals
plotGroupColors = ["blue", "blue", "turquoise1", "grey", "seagreen4", "seagreen3", "seagreen2", "olivedrab3", "olivedrab2", "olivedrab1", "yellow"]
group1 = "vir" # these groups will determine the color used in the graph
group2 = "troch_LN"
groupsToCompare = "Fst_among" #"vir_plumb"
Fst_cutoff = 0.8
missingFractionAllowed = 0.2 # only show SNPs with less than this fraction of missing data among individuals
# Calculate allele freqs and sample sizes (use column Fst_group)
freqs, sampleSizes = getFreqsAndSampleSizes(genosOnly_with_missing, ind_with_metadata_indFiltered.Fst_group, groups)
println("Calculated population allele frequencies and sample sizes")
Fst, FstNumerator, FstDenominator, pairwiseNamesFst = getFst(freqs, sampleSizes, groups; among=true) # set among to FALSE if no among Fst wanted (some things won't work without it)
println("Calculated Fst values")
scaffolds_to_plot = ["gw1"]
for i in 1:length(scaffolds_to_plot)
chr = scaffolds_to_plot[i]
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr]) # this gets the maximum position for the chromosome
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff,
missingFractionAllowed, regionInfo,
pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_with_missing, ind_with_metadata_indFiltered, freqs,
plotGroups, plotGroupColors;
indFontSize=4, figureSize=(1200,1600))
println("Completed the figure for ", chr, ".")
end
#for chr in scaffolds_to_plot
# println(chr)
#end
# plotInfo contains a tuple with: (f, plottedGenotype, locations, plottedMetadata)Calculated population allele frequencies and sample sizes
Calculated Fst values
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
Completed the figure for gw1.
Make GBI plots showing just east individuals:
groups = ["troch_LN","obs","plumb"] # for purpose of calculating pairwise Fst and Fst_group (to determine SNPs)
plotGroups = ["troch_LN", "troch_EM", "obs", "plumb_BJ", "plumb"] # east GW individuals
plotGroupColors = ["yellow", "gold", "orange", "pink", "red"]
group1 = "troch_LN" # these groups will determine the color used in the graph
group2 = "plumb"
groupsToCompare = "Fst_among" #"vir_plumb"
Fst_cutoff = 0.8
missingFractionAllowed = 0.2 # only show SNPs with less than this fraction of missing data among individuals
# Calculate allele freqs and sample sizes (use column Fst_group)
freqs, sampleSizes = getFreqsAndSampleSizes(genosOnly_with_missing, ind_with_metadata_indFiltered.Fst_group, groups)
println("Calculated population allele frequencies and sample sizes")
Fst, FstNumerator, FstDenominator, pairwiseNamesFst = getFst(freqs, sampleSizes, groups; among=true) # set among to FALSE if no among Fst wanted (some things won't work without it)
println("Calculated Fst values")
for i in 1:length(scaffolds_to_plot)
chr = scaffolds_to_plot[i]
regionInfo = chooseChrRegion(pos_SNP_filtered, chr; positionMin=1, positionMax=scaffold_lengths[chr]) # this gets the maximum position for the chromosome
plotInfo = plotGenotypeByIndividualWithFst(groupsToCompare, Fst_cutoff,
missingFractionAllowed, regionInfo,
pos_SNP_filtered, Fst, pairwiseNamesFst,
genosOnly_with_missing, ind_with_metadata_indFiltered, freqs,
plotGroups, plotGroupColors;
indFontSize=4, figureSize=(1200,1600))
println("Completed the figure for ", chr, ".")
end
# plotInfo contains a tuple with: (f, plottedGenotype, locations, plottedMetadata)Calculated population allele frequencies and sample sizes
Calculated Fst values
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie ~/.julia/packages/Makie/pFPBw/src/scenes.jl:238
Completed the figure for gw1.